pysparksparkjdbcrdbmsparalleldata-engineering

Bulk JDBC Reads and Writes in PySpark — The Single-Connection Bottleneck and Protecting Your Production DB

How to avoid the single-connection bottleneck where only one executor works when reading hundreds of millions of rows from an RDBMS with Spark: parallel reads with partitionColumn, fetchsize and batchsize tuning, and write strategies that won't bring down your production database.

Data DynamicsJune 5, 20266 min read

Reading from and writing to an RDBMS like PostgreSQL, MySQL, or Oracle with Spark is a common task. But if you naively call spark.read.jdbc(...), a single connection keeps only one executor busy while the rest of the cluster sits idle. Reading hundreds of millions of rows on a single thread takes hours. On the flip side, cranking up parallelism blindly can bring your production DB to its knees with a connection storm.

This post covers how to parallelize bulk JDBC reads, the key tuning parameters, and write strategies that protect your production database.

1. The Default-Behavior Trap — A Single Partition

# Dangerous: reading without partition options means one connection = one task
df = (spark.read
    .format("jdbc")
    .option("url", "jdbc:postgresql://db:5432/app")
    .option("dbtable", "events")
    .option("user", "trino").option("password", "...")
    .load())
# → only 1 executor works, 100M rows on a single thread ☠️

Without partition options, Spark pulls the entire query result into a single partition. It doesn't matter how big your cluster is.

2. Parallel Reads — partitionColumn

To parallelize reads, you have to tell Spark "which column to split on, and into how many pieces." Spark uses this information to issue multiple range-based queries concurrently.

df = (spark.read
    .format("jdbc")
    .option("url", "jdbc:postgresql://db:5432/app")
    .option("dbtable", "events")
    .option("user", "trino").option("password", "...")
    # The four options for parallel reads
    .option("partitionColumn", "id")       # split column (numeric/date, indexed)
    .option("lowerBound", "1")             # minimum value
    .option("upperBound", "100000000")     # maximum value
    .option("numPartitions", "32")         # parallelism = concurrent connections
    .load())

Spark translates this into:

Partition 1: WHERE id >= 1          AND id < 3125000
Partition 2: WHERE id >= 3125000    AND id < 6250000
...
Partition 32: WHERE id >= 96875000  AND id <= 100000000
→ 32 connections read each range concurrently

Option	Meaning	Caveat
`partitionColumn`	Column to split on	Numeric/date column with an index
`lowerBound`/`upperBound`	Range boundaries	Must match the data distribution
`numPartitions`	Parallelism (= concurrent connections)	Directly drives DB load

Key point: lowerBound/upperBound are for range splitting, not filtering. Rows outside this range still land in the first/last partition. However, if the boundaries diverge significantly from the actual distribution, partitions become uneven and you get skew.

3. Partition Column Skew — The Most Common Failure

If the values of partitionColumn aren't evenly distributed, some partitions end up empty while others become huge.

If id ranges 1~100M but 99% of rows are actually packed into 1~1M
→ only the first partition is huge, the other 31 finish nearly empty → parallelism is pointless

Mitigations:

Pick a uniformly distributed column (auto-increment ID, evenly spread timestamp).
If the distribution is skewed, align lowerBound/upperBound with the actual distribution, or precompute a hash-based partition column.
If you don't know the distribution, query the DB for quantiles first and set the boundaries from those.

4. fetchsize — How Many Rows Per Round Trip

fetchSize is the number of rows the JDBC driver pulls per network round trip. If the default is small (it varies by driver; some use 10), round trips explode and reads crawl.

.option("fetchsize", "10000")    # fetch 10k rows at a time

fetchsize	Effect
Too small	Network round trips explode → slow
Too large	Memory pressure on executors
Reasonable (1k~10k)	Balanced round trips and memory

PostgreSQL requires fetchsize to be set for cursor-based streaming to kick in (otherwise the entire result set lands in client memory). MySQL may need a separate option. Check your driver's behavior.

5. Protecting the Production DB — The Two Faces of Parallelism

numPartitions is a double-edged sword. Higher means faster, but it puts that many concurrent connections and queries on your production DB.

numPartitions = 64  → 64 concurrent full-scan queries against the production DB
→ resource contention with production transactions, connection pool exhaustion, possible DB outage

Principles for protecting the production DB:

Principle	How
Use a read replica	Connect to a read replica instead of the production primary
Cap parallelism	Keep `numPartitions` at a level the DB can handle
Off-peak windows	Run bulk extracts in nightly batches
Know the connection limit	Leave headroom against the DB's max_connections
Query timeouts	Block runaway queries with statement_timeout on the DB side

Iron rule: point bulk JDBC extracts at a read replica, not the production primary. Analytical full scans must not interfere with production transactions. (The same principle applies in Trino federation — see our separate post "Trino Federation in Practice".)

6. Bulk Writes — batchsize and Idempotency

Spark → RDBMS writes need tuning too.

(df.write
    .format("jdbc")
    .option("url", "...")
    .option("dbtable", "target")
    .option("batchsize", "5000")           # INSERT batch size
    .option("numPartitions", "16")         # concurrent write connections
    .option("isolationLevel", "NONE")      # for bulk loads (use with care)
    .mode("append")
    .save())

Option	Role
`batchsize`	Rows per batched INSERT (fewer round trips)
`numPartitions`	Concurrent write connections
`truncate`	TRUNCATE instead of DROP on overwrite

Write pitfalls:

Idempotency: re-running an append creates duplicates. If reprocessing is possible, load into a staging table and MERGE/UPSERT on the DB side, or design explicit key-conflict handling.
Production load: bulk writes also stress the DB. Tune parallelism and batch size.
Alternative: instead of writing bulk data directly to the RDBMS, it's usually better to write to a Lakehouse (Iceberg/Delta) and feed the DB only small result sets.

7. Choosing a Pattern

Scenario	Recommendation
Bulk read (full extract)	partitionColumn parallelism + replica + off-peak
Incremental read	WHERE clause for changed rows only (subquery in `dbtable`)
Bulk result write	Write to the Lakehouse (avoid writing directly to the DB)
Small result write	Reasonable batchsize + idempotent design
Repeated join analytics	Snapshot the DB into the Lakehouse, then analyze there

# Incremental read: subquery in dbtable to fetch only changed rows
.option("dbtable", "(SELECT * FROM events WHERE updated_at >= '2026-06-01') AS t")

8. Summary

Area	Key Point
Parallel reads	partitionColumn + bounds + numPartitions
Partition column	Uniformly distributed, indexed column
fetchsize	Cut network round trips
DB protection	Read replica, capped parallelism, off-peak
Writes	batchsize, idempotent design, Lakehouse first

Bulk JDBC I/O comes down to two balancing acts — parallelize enough to use your cluster, while staying below the line that takes down your production DB. Turn on parallel reads with partitionColumn and cut round trips with fetchsize, but use a read replica and cap parallelism at what the DB can handle. And for bulk results, loading into a Lakehouse instead of writing directly to the RDBMS is the healthiest architecture for keeping production and analytics apart.

This article is based on Spark 3.5. If you need help with RDBMS-Lakehouse integration or designing bulk extraction pipelines, feel free to reach out.

— Data Dynamics Engineering Team