pysparksparkpandasmigrationpandas-on-sparkdata-engineering

From pandas to PySpark — Scaling Out Beyond a Single Machine

A migration guide for pandas users who hit data that no longer fits in memory. We cover the lazy vs eager execution model difference, the common traps (loops, indexes, collect), and how to run your code almost unchanged at scale with pandas API on Spark.

Data DynamicsJune 5, 20267 min read

Analysis code that ran fine in pandas starts throwing MemoryError as the data grows. The moment your data no longer fits in a single machine's memory, you start thinking about moving to PySpark. But pandas and Spark have fundamentally different execution models — carry your familiar pandas habits over as-is and things get slow, or blow up memory even harder.

This post covers the differences pandas users absolutely must understand when moving to PySpark, the traps people commonly fall into, and the path to distributed processing with almost no code changes (pandas API on Spark).

1. The Biggest Difference — Lazy vs Eager

pandas executes eagerly. Write a line and it computes on the spot, holding the result in memory. Spark executes lazily. It just accumulates transformations, and only when it hits an action does it optimize and execute everything at once.

# pandas: each line executes immediately
df2 = df[df.amount > 100]      # filtered right away
df3 = df2.groupby("user").sum() # aggregated right away
 
# PySpark: transformations only build a plan; actions execute it
df2 = df.filter(F.col("amount") > 100)    # not executed yet
df3 = df2.groupBy("user").sum()            # still just a plan
df3.show()                                  # <- only here does everything run

	pandas (eager)	PySpark (lazy)
When it runs	immediately, per line	all at once, at an action
Optimization	none	Catalyst optimizes the whole plan
Debugging	inspect intermediate results instantly	nothing to inspect before an action

Once you understand this difference, you understand "why the error shows up on the wrong line" (in reality, everything executes — and blows up — at the action).

2. Trap 1: Loops — Never Iterate Over Rows

The pandas habit of row iteration with iterrows or apply is the worst thing you can do in Spark.

# pandas (already slow, but it works)
for idx, row in df.iterrows():
    df.at[idx, "grade"] = compute(row)
 
# Mimic this in PySpark? -> collect blows up driver memory + kills the distribution
# The right way: column operations (vectorized)
df = df.withColumn("grade",
    F.when(F.col("score") >= 90, "A").otherwise("B"))

Spark's power lies in distributed column-level operations. The moment you iterate over rows, the benefit of distribution disappears. The same goes for pandas' apply(axis=1) — replace it with column expressions or (if unavoidable) pandas_udf (see the separate post "Why PySpark UDFs Are Slow, and Pandas UDFs").

3. Trap 2: collect / toPandas — Pulling Everything to the Driver

# Dangerous: distributed data into a single driver's memory -> OOM on large data
result = spark_df.toPandas()
rows = spark_df.collect()
 
# Safe: write distributed, or only after shrinking the data
spark_df.write.parquet("out")              # executors write in parallel
sample = spark_df.limit(1000).toPandas()   # only as much as you need
agg = spark_df.groupBy("k").count().toPandas()  # only a small aggregated result

The urge to "come back" to pandas is the most dangerous one. If the result is large, never collect it to the driver (see the driver OOM section in the separate post "Conquering PySpark Executor OOM").

4. Trap 3: Indexes — Spark Has No Row Index

The index, central to pandas, doesn't exist in Spark. Order- and position-based access (df.iloc[5], df.loc[idx]) doesn't work.

# pandas: positional/index access
df.iloc[0]
df.set_index("id")
 
# PySpark: if you need order, sort explicitly + use a window
from pyspark.sql.window import Window
df = df.withColumn("row_num",
    F.row_number().over(Window.orderBy("created_at")))

Spark data is distributed and has no order. If you need order, sort explicitly; if you need row numbers, use a window function.

5. Trap 4: Data Types and NULL

# pandas: NaN, dynamic types
# PySpark: explicit schema, null
 
# pandas object columns with mixed types -> Spark schemas are strict
# When reading CSV, don't rely on schema inference; declare it
schema = "id long, amount double, name string"
df = spark.read.schema(schema).csv("path", header=True)

6. API Mapping — The Operations You Know by Heart

Task	pandas	PySpark
Filter	`df[df.x > 1]`	`df.filter(F.col("x") > 1)`
Add column	`df["y"] = ...`	`df.withColumn("y", ...)`
Aggregate	`df.groupby("k").sum()`	`df.groupBy("k").sum()`
Join	`pd.merge(a, b, on="k")`	`a.join(b, "k")`
Sort	`df.sort_values("x")`	`df.orderBy("x")`
Rename column	`df.rename(...)`	`df.withColumnRenamed(...)`
Handle missing values	`df.fillna(0)`	`df.fillna(0)` / `F.coalesce`
Unique values	`df.x.unique()`	`df.select("x").distinct()`
Row count	`len(df)`	`df.count()` (an action!)

7. The Easiest Path — pandas API on Spark

If you want distributed processing with almost no code changes, use pandas API on Spark (formerly Koalas, pyspark.pandas). It runs a nearly identical pandas API on top of Spark.

import pyspark.pandas as ps
 
# almost identical to pandas code!
psdf = ps.read_parquet("hdfs://.../big_data")
result = (psdf[psdf.amount > 100]
    .groupby("user")["amount"]
    .sum()
    .sort_values(ascending=False))
 
# convert only the small result that truly needs real pandas
small = result.head(100).to_pandas()

	Pure PySpark	pandas API on Spark
Code changes	many (different API)	almost none
Familiarity	new API to learn	pandas as-is
Fine-grained control	high	somewhat abstracted
Best for	new code / performance-critical	porting pandas code

Migration strategy: if you have a lot of existing pandas code, the realistic approach is incremental — port it to pandas API on Spark first to get it working, then rewrite only the performance-critical hot spots in pure PySpark.

8. Spark Is Overkill for Small Data

There's a trap in the opposite direction too. If your data is small, Spark is actually slower — because of shuffle, JVM, and scheduling overhead.

Data scale	Recommendation
A few GB or less, fits on one machine	pandas (or Polars/DuckDB)
Exceeds single-machine memory	PySpark
Lots of pandas code, data growing	pandas API on Spark

"Big data = Spark" isn't always right. If it fits on a single machine, pandas/Polars/DuckDB are faster and simpler.

9. Migration Checklist

Understand lazy execution — nothing runs before an action
iterrows/apply(axis=1) → column operations/pandas_udf
No toPandas/collect — write distributed or shrink first
Remove index dependencies → sort + window
Declare schemas explicitly (reduce reliance on inference)
Port large pandas codebases to pandas API on Spark
Don't bother with Spark for small data

10. Wrap-up

Difference/trap	pandas habit	PySpark answer
Execution	eager	lazy (runs at an action)
Row processing	iterrows/apply	vectorized column operations
Collecting results	everything in memory	distributed writes/limit
Order	index	sort + window
Porting	—	pandas API on Spark

Moving from pandas to PySpark isn't "swapping APIs" — it's "changing how you think." pandas' premises — eager execution, row iteration, holding everything in memory — are all inverted in a distributed environment. Understand lazy execution and column operations, resist the temptation of toPandas, and migrate existing code incrementally via pandas API on Spark — that's the smoothest way past the limits of a single machine.

This article is based on Spark 3.5. If you need help scaling out pandas-based analytics or migrating data pipelines, feel free to reach out anytime.

— Data Dynamics Engineering Team