Blog
pysparksparkpandasmigrationpandas-on-sparkdata-engineering

From pandas to PySpark — Scaling Out Beyond a Single Machine

A migration guide for pandas users who hit data that no longer fits in memory. We cover the lazy vs eager execution model difference, the common traps (loops, indexes, collect), and how to run your code almost unchanged at scale with pandas API on Spark.

Data DynamicsJune 5, 20267 min read

Analysis code that ran fine in pandas starts throwing MemoryError as the data grows. The moment your data no longer fits in a single machine's memory, you start thinking about moving to PySpark. But pandas and Spark have fundamentally different execution models — carry your familiar pandas habits over as-is and things get slow, or blow up memory even harder.

This post covers the differences pandas users absolutely must understand when moving to PySpark, the traps people commonly fall into, and the path to distributed processing with almost no code changes (pandas API on Spark).

1. The Biggest Difference — Lazy vs Eager

pandas executes eagerly. Write a line and it computes on the spot, holding the result in memory. Spark executes lazily. It just accumulates transformations, and only when it hits an action does it optimize and execute everything at once.

# pandas: each line executes immediately
df2 = df[df.amount > 100]      # filtered right away
df3 = df2.groupby("user").sum() # aggregated right away
 
# PySpark: transformations only build a plan; actions execute it
df2 = df.filter(F.col("amount") > 100)    # not executed yet
df3 = df2.groupBy("user").sum()            # still just a plan
df3.show()                                  # <- only here does everything run
pandas (eager)PySpark (lazy)
When it runsimmediately, per lineall at once, at an action
OptimizationnoneCatalyst optimizes the whole plan
Debugginginspect intermediate results instantlynothing to inspect before an action

Once you understand this difference, you understand "why the error shows up on the wrong line" (in reality, everything executes — and blows up — at the action).

2. Trap 1: Loops — Never Iterate Over Rows

The pandas habit of row iteration with iterrows or apply is the worst thing you can do in Spark.

# pandas (already slow, but it works)
for idx, row in df.iterrows():
    df.at[idx, "grade"] = compute(row)
 
# Mimic this in PySpark? -> collect blows up driver memory + kills the distribution
# The right way: column operations (vectorized)
df = df.withColumn("grade",
    F.when(F.col("score") >= 90, "A").otherwise("B"))

Spark's power lies in distributed column-level operations. The moment you iterate over rows, the benefit of distribution disappears. The same goes for pandas' apply(axis=1) — replace it with column expressions or (if unavoidable) pandas_udf (see the separate post "Why PySpark UDFs Are Slow, and Pandas UDFs").

3. Trap 2: collect / toPandas — Pulling Everything to the Driver

# Dangerous: distributed data into a single driver's memory -> OOM on large data
result = spark_df.toPandas()
rows = spark_df.collect()
 
# Safe: write distributed, or only after shrinking the data
spark_df.write.parquet("out")              # executors write in parallel
sample = spark_df.limit(1000).toPandas()   # only as much as you need
agg = spark_df.groupBy("k").count().toPandas()  # only a small aggregated result

The urge to "come back" to pandas is the most dangerous one. If the result is large, never collect it to the driver (see the driver OOM section in the separate post "Conquering PySpark Executor OOM").

4. Trap 3: Indexes — Spark Has No Row Index

The index, central to pandas, doesn't exist in Spark. Order- and position-based access (df.iloc[5], df.loc[idx]) doesn't work.

# pandas: positional/index access
df.iloc[0]
df.set_index("id")
 
# PySpark: if you need order, sort explicitly + use a window
from pyspark.sql.window import Window
df = df.withColumn("row_num",
    F.row_number().over(Window.orderBy("created_at")))

Spark data is distributed and has no order. If you need order, sort explicitly; if you need row numbers, use a window function.

5. Trap 4: Data Types and NULL

# pandas: NaN, dynamic types
# PySpark: explicit schema, null
 
# pandas object columns with mixed types -> Spark schemas are strict
# When reading CSV, don't rely on schema inference; declare it
schema = "id long, amount double, name string"
df = spark.read.schema(schema).csv("path", header=True)

6. API Mapping — The Operations You Know by Heart

TaskpandasPySpark
Filterdf[df.x > 1]df.filter(F.col("x") > 1)
Add columndf["y"] = ...df.withColumn("y", ...)
Aggregatedf.groupby("k").sum()df.groupBy("k").sum()
Joinpd.merge(a, b, on="k")a.join(b, "k")
Sortdf.sort_values("x")df.orderBy("x")
Rename columndf.rename(...)df.withColumnRenamed(...)
Handle missing valuesdf.fillna(0)df.fillna(0) / F.coalesce
Unique valuesdf.x.unique()df.select("x").distinct()
Row countlen(df)df.count() (an action!)

7. The Easiest Path — pandas API on Spark

If you want distributed processing with almost no code changes, use pandas API on Spark (formerly Koalas, pyspark.pandas). It runs a nearly identical pandas API on top of Spark.

import pyspark.pandas as ps
 
# almost identical to pandas code!
psdf = ps.read_parquet("hdfs://.../big_data")
result = (psdf[psdf.amount > 100]
    .groupby("user")["amount"]
    .sum()
    .sort_values(ascending=False))
 
# convert only the small result that truly needs real pandas
small = result.head(100).to_pandas()
Pure PySparkpandas API on Spark
Code changesmany (different API)almost none
Familiaritynew API to learnpandas as-is
Fine-grained controlhighsomewhat abstracted
Best fornew code / performance-criticalporting pandas code

Migration strategy: if you have a lot of existing pandas code, the realistic approach is incremental — port it to pandas API on Spark first to get it working, then rewrite only the performance-critical hot spots in pure PySpark.

8. Spark Is Overkill for Small Data

There's a trap in the opposite direction too. If your data is small, Spark is actually slower — because of shuffle, JVM, and scheduling overhead.

Data scaleRecommendation
A few GB or less, fits on one machinepandas (or Polars/DuckDB)
Exceeds single-machine memoryPySpark
Lots of pandas code, data growingpandas API on Spark

"Big data = Spark" isn't always right. If it fits on a single machine, pandas/Polars/DuckDB are faster and simpler.

9. Migration Checklist

  • Understand lazy execution — nothing runs before an action
  • iterrows/apply(axis=1) → column operations/pandas_udf
  • No toPandas/collect — write distributed or shrink first
  • Remove index dependencies → sort + window
  • Declare schemas explicitly (reduce reliance on inference)
  • Port large pandas codebases to pandas API on Spark
  • Don't bother with Spark for small data

10. Wrap-up

Difference/trappandas habitPySpark answer
Executioneagerlazy (runs at an action)
Row processingiterrows/applyvectorized column operations
Collecting resultseverything in memorydistributed writes/limit
Orderindexsort + window
Portingpandas API on Spark

Moving from pandas to PySpark isn't "swapping APIs" — it's "changing how you think." pandas' premises — eager execution, row iteration, holding everything in memory — are all inverted in a distributed environment. Understand lazy execution and column operations, resist the temptation of toPandas, and migrate existing code incrementally via pandas API on Spark — that's the smoothest way past the limits of a single machine.


This article is based on Spark 3.5. If you need help scaling out pandas-based analytics or migrating data pipelines, feel free to reach out anytime.

— Data Dynamics Engineering Team