Bulk JDBC Reads and Writes in PySpark — The Single-Connection Bottleneck and Protecting Your Production DB
How to avoid the single-connection bottleneck where only one executor works when reading hundreds of millions of rows from an RDBMS with Spark: parallel reads with partitionColumn, fetchsize and batchsize tuning, and write strategies that won't bring down your production database.
Reading from and writing to an RDBMS like PostgreSQL, MySQL, or Oracle with Spark is a common task. But if you naively call spark.read.jdbc(...), a single connection keeps only one executor busy while the rest of the cluster sits idle. Reading hundreds of millions of rows on a single thread takes hours. On the flip side, cranking up parallelism blindly can bring your production DB to its knees with a connection storm.
This post covers how to parallelize bulk JDBC reads, the key tuning parameters, and write strategies that protect your production database.
1. The Default-Behavior Trap — A Single Partition
# Dangerous: reading without partition options means one connection = one task
df = (spark.read
.format("jdbc")
.option("url", "jdbc:postgresql://db:5432/app")
.option("dbtable", "events")
.option("user", "trino").option("password", "...")
.load())
# → only 1 executor works, 100M rows on a single thread ☠️Without partition options, Spark pulls the entire query result into a single partition. It doesn't matter how big your cluster is.
2. Parallel Reads — partitionColumn
To parallelize reads, you have to tell Spark "which column to split on, and into how many pieces." Spark uses this information to issue multiple range-based queries concurrently.
df = (spark.read
.format("jdbc")
.option("url", "jdbc:postgresql://db:5432/app")
.option("dbtable", "events")
.option("user", "trino").option("password", "...")
# The four options for parallel reads
.option("partitionColumn", "id") # split column (numeric/date, indexed)
.option("lowerBound", "1") # minimum value
.option("upperBound", "100000000") # maximum value
.option("numPartitions", "32") # parallelism = concurrent connections
.load())Spark translates this into:
Partition 1: WHERE id >= 1 AND id < 3125000
Partition 2: WHERE id >= 3125000 AND id < 6250000
...
Partition 32: WHERE id >= 96875000 AND id <= 100000000
→ 32 connections read each range concurrently| Option | Meaning | Caveat |
|---|---|---|
partitionColumn | Column to split on | Numeric/date column with an index |
lowerBound/upperBound | Range boundaries | Must match the data distribution |
numPartitions | Parallelism (= concurrent connections) | Directly drives DB load |
Key point:
lowerBound/upperBoundare for range splitting, not filtering. Rows outside this range still land in the first/last partition. However, if the boundaries diverge significantly from the actual distribution, partitions become uneven and you get skew.
3. Partition Column Skew — The Most Common Failure
If the values of partitionColumn aren't evenly distributed, some partitions end up empty while others become huge.
If id ranges 1~100M but 99% of rows are actually packed into 1~1M
→ only the first partition is huge, the other 31 finish nearly empty → parallelism is pointlessMitigations:
- Pick a uniformly distributed column (auto-increment ID, evenly spread timestamp).
- If the distribution is skewed, align
lowerBound/upperBoundwith the actual distribution, or precompute a hash-based partition column. - If you don't know the distribution, query the DB for quantiles first and set the boundaries from those.
4. fetchsize — How Many Rows Per Round Trip
fetchSize is the number of rows the JDBC driver pulls per network round trip. If the default is small (it varies by driver; some use 10), round trips explode and reads crawl.
.option("fetchsize", "10000") # fetch 10k rows at a time| fetchsize | Effect |
|---|---|
| Too small | Network round trips explode → slow |
| Too large | Memory pressure on executors |
| Reasonable (1k~10k) | Balanced round trips and memory |
PostgreSQL requires fetchsize to be set for cursor-based streaming to kick in (otherwise the entire result set lands in client memory). MySQL may need a separate option. Check your driver's behavior.
5. Protecting the Production DB — The Two Faces of Parallelism
numPartitions is a double-edged sword. Higher means faster, but it puts that many concurrent connections and queries on your production DB.
numPartitions = 64 → 64 concurrent full-scan queries against the production DB
→ resource contention with production transactions, connection pool exhaustion, possible DB outagePrinciples for protecting the production DB:
| Principle | How |
|---|---|
| Use a read replica | Connect to a read replica instead of the production primary |
| Cap parallelism | Keep numPartitions at a level the DB can handle |
| Off-peak windows | Run bulk extracts in nightly batches |
| Know the connection limit | Leave headroom against the DB's max_connections |
| Query timeouts | Block runaway queries with statement_timeout on the DB side |
Iron rule: point bulk JDBC extracts at a read replica, not the production primary. Analytical full scans must not interfere with production transactions. (The same principle applies in Trino federation — see our separate post "Trino Federation in Practice".)
6. Bulk Writes — batchsize and Idempotency
Spark → RDBMS writes need tuning too.
(df.write
.format("jdbc")
.option("url", "...")
.option("dbtable", "target")
.option("batchsize", "5000") # INSERT batch size
.option("numPartitions", "16") # concurrent write connections
.option("isolationLevel", "NONE") # for bulk loads (use with care)
.mode("append")
.save())| Option | Role |
|---|---|
batchsize | Rows per batched INSERT (fewer round trips) |
numPartitions | Concurrent write connections |
truncate | TRUNCATE instead of DROP on overwrite |
Write pitfalls:
- Idempotency: re-running an
appendcreates duplicates. If reprocessing is possible, load into a staging table and MERGE/UPSERT on the DB side, or design explicit key-conflict handling. - Production load: bulk writes also stress the DB. Tune parallelism and batch size.
- Alternative: instead of writing bulk data directly to the RDBMS, it's usually better to write to a Lakehouse (Iceberg/Delta) and feed the DB only small result sets.
7. Choosing a Pattern
| Scenario | Recommendation |
|---|---|
| Bulk read (full extract) | partitionColumn parallelism + replica + off-peak |
| Incremental read | WHERE clause for changed rows only (subquery in dbtable) |
| Bulk result write | Write to the Lakehouse (avoid writing directly to the DB) |
| Small result write | Reasonable batchsize + idempotent design |
| Repeated join analytics | Snapshot the DB into the Lakehouse, then analyze there |
# Incremental read: subquery in dbtable to fetch only changed rows
.option("dbtable", "(SELECT * FROM events WHERE updated_at >= '2026-06-01') AS t")8. Summary
| Area | Key Point |
|---|---|
| Parallel reads | partitionColumn + bounds + numPartitions |
| Partition column | Uniformly distributed, indexed column |
| fetchsize | Cut network round trips |
| DB protection | Read replica, capped parallelism, off-peak |
| Writes | batchsize, idempotent design, Lakehouse first |
Bulk JDBC I/O comes down to two balancing acts — parallelize enough to use your cluster, while staying below the line that takes down your production DB. Turn on parallel reads with partitionColumn and cut round trips with fetchsize, but use a read replica and cap parallelism at what the DB can handle. And for bulk results, loading into a Lakehouse instead of writing directly to the RDBMS is the healthiest architecture for keeping production and analytics apart.
This article is based on Spark 3.5. If you need help with RDBMS-Lakehouse integration or designing bulk extraction pipelines, feel free to reach out.
— Data Dynamics Engineering Team