Apache Iceberg Performance Tuning Guide — Pitfalls and Key Parameters
A practical guide to the pitfalls, anti-patterns, and per-category key table properties (write/read/commit/compaction/manifest) and engine settings (Spark/Trino/Flink) that determine Apache Iceberg performance.
With Iceberg, "how performance changes over time" matters far more than the initial adoption. As metadata accumulates, small files pile up, delete files grow, and snapshots reach the tens of thousands, the same query can become tens of times slower. This guide consolidates the pitfalls and per-category key parameters that drive Iceberg performance from a practical standpoint.
1. The Iceberg Execution Model from a Performance Lens
Two essentials before diving into tuning.
1.1 What Happens on a Single Query
Catalog lookup → location of metadata.json
Read metadata.json → determine current snapshot
Read manifest list → manifest pruning via partition stats
Read manifest files (par.) → data file pruning via column stats
Open data files (par.) → Parquet footer + row-group skipping
Scan row groups → decode only needed row groups
At each stage file count, file size, statistics quality, and sort order affect performance. Push one too hard and another collapses.
1.2 What Happens on a Single Write
Plan write → apply partition/sort policy
Write data files → Parquet encoding + stats collection
Write delete files (MoR) → position/equality deletes
Write manifest files → entries for new files
Write manifest list → combine existing + new manifests
Atomic commit → catalog CAS swaps snapshot pointer
Frequent commits explode manifests and snapshots. Write frequency and file size are practically the biggest perf determinants.
2. Top 10 Pitfalls
In rough order of impact.
2.1 Small File Accumulation
The most common issue in streaming/CDC. Tens of thousands of 1–10MB files push plan time into minutes.
- Symptom:
EXPLAINshows tens of thousands of files; plan time abnormally large relative to scan time - Fix: Periodic
rewrite_data_files, tunewrite.target-file-size-bytes
2.2 Skipping Snapshot Expiration
Accumulated snapshots bloat metadata.json to gigabytes and block GC of data files.
- Symptom:
metadata.jsonis tens of MB or more; opening the table itself is slow - Fix: Daily
expire_snapshots
2.3 Over-Frequent Commits
Sub-minute commits in streaming explode manifest counts.
- Recommended: Flink commit interval ≥ 1 minute, preferably 5–10 minutes
- Fix: Periodic
rewrite_manifests
2.4 Delete File Accumulation (MoR)
Accumulated V2 delete files force an anti-join on every read.
- Symptom: Read time grows linearly over time
- Fix:
rewrite_position_deletes; adopt V3 Deletion Vectors
2.5 Bad Partitioning
- Partitioning by high-cardinality columns (e.g.,
user_id): list cost explodes; manifests bloat - Time partitions too fine (e.g., hourly with
<10krows per bucket): file-count explosion - Missing partition column: full scan
2.6 Missing or Inaccurate Statistics
Without column stats (min/max, null count) — or with wrong ones — manifest pruning is neutered.
- Cause: blanket stats collection across hundreds of columns → some get skipped
- Fix:
write.metadata.metrics.default+ per-columnmetrics.column.<col>
2.7 Hadoop Catalog in Production
File-lock-based; risk of data loss on concurrent commit conflicts. Never use outside dev/test.
2.8 Manifest Bloat
Scanning thousands of manifests per query inflates plan time.
- Fix:
commit.manifest-merge.enabled=true, periodicrewrite_manifests
2.9 No Sort Order
Without sorting on frequently filtered columns, manifest min/max ranges are too wide to enable pruning.
2.10 Vectorized Reader Disabled per Engine
Disabling the Spark/Trino vectorized Parquet reader causes a 2–5× slowdown. It is ON by default but can get accidentally turned off in production.
3. Key Table Properties — Write
Set via TBLPROPERTIES. Focusing on the commonly used ones.
3.1 File Size and Format
| Property | Default | Description |
|---|---|---|
write.format.default | parquet | Data file format (parquet/orc/avro) |
write.target-file-size-bytes | 536870912 (512MB) | Target file size. Analytics: 256MB–1GB |
write.parquet.row-group-size-bytes | 134217728 (128MB) | Parquet row-group size; too large reduces skip benefit |
write.parquet.page-size-bytes | 1048576 (1MB) | Parquet page size |
write.parquet.dict-size-bytes | 2097152 (2MB) | Max dictionary encoding size |
write.parquet.compression-codec | zstd | zstd/snappy/gzip/lz4 — analytics: zstd |
write.parquet.compression-level | (codec-specific) | zstd 1–9, typically 3 |
write.parquet.bloom-filter-enabled.column.<col> | false | For high-cardinality equality filters |
write.metadata.compression-codec | none | Metadata compression; gzip for large tables |
write.object-storage.enabled | false | S3 key-prefix randomization to avoid throttling |
ALTER TABLE db.orders SET TBLPROPERTIES (
'write.target-file-size-bytes' = '536870912',
'write.parquet.compression-codec' = 'zstd',
'write.parquet.compression-level' = '3',
'write.metadata.compression-codec' = 'gzip',
'write.object-storage.enabled' = 'true'
);3.2 Distribution and Sorting
| Property | Default | Description |
|---|---|---|
write.distribution-mode | hash (Spark v3+) | none/hash/range. Trade-off between file count and skew |
write.delete.distribution-mode | (same as write) | Delete file distribution |
write.update.distribution-mode | (same as write) | Update distribution |
write.spark.fanout.enabled | false | Per-partition fanout writer; enable when partitions are very many |
hash groups data per partition, reducing small files but vulnerable to skew. range pairs with sorting for balanced distribution.
3.3 Row-Level Operation Modes
| Property | Default (V2) | Description |
|---|---|---|
write.delete.mode | copy-on-write | copy-on-write or merge-on-read |
write.update.mode | copy-on-write | Same |
write.merge.mode | copy-on-write | Same |
write.delete.target-file-size-bytes | 67108864 (64MB) | Delete file size |
write.delete.parquet.compression-codec | zstd | Delete file compression |
Use MoR for CDC/frequent changes, CoW for batch ETL. Mismatched modes lead to surprising costs.
3.4 Metadata Cleanup
| Property | Default | Description |
|---|---|---|
write.metadata.delete-after-commit.enabled | false | Auto-delete old metadata.json after commit |
write.metadata.previous-versions-max | 100 | Retained metadata.json versions |
Production typically uses delete-after-commit=true with previous-versions-max=20.
3.5 Statistics Collection
| Property | Default | Description |
|---|---|---|
write.metadata.metrics.default | truncate(16) | Default stats mode for all columns |
write.metadata.metrics.column.<col> | — | Per-column override (none/counts/truncate(N)/full) |
write.metadata.metrics.max-inferred-column-defaults | 100 | Max columns to collect default stats |
For wide tables with hundreds of columns, stats bloat metadata. Use full for the 5–10 frequently filtered columns and counts or none for the rest.
ALTER TABLE db.orders SET TBLPROPERTIES (
'write.metadata.metrics.default' = 'counts',
'write.metadata.metrics.column.order_id' = 'full',
'write.metadata.metrics.column.ts' = 'full',
'write.metadata.metrics.column.customer_id' = 'full'
);4. Key Table Properties — Read
| Property | Default | Description |
|---|---|---|
read.split.target-size | 134217728 (128MB) | Bytes processed per task |
read.split.metadata-target-size | 33554432 (32MB) | Metadata table split size |
read.split.planning-lookback | 10 | Split packing lookback |
read.split.open-file-cost | 4MB | Estimated file-open cost (small-file combine threshold) |
read.parquet.vectorization.enabled | true | Parquet vectorized reader |
read.parquet.vectorization.batch-size | 5000 | Vectorized batch row count |
ALTER TABLE db.orders SET TBLPROPERTIES (
'read.split.target-size' = '268435456',
'read.split.open-file-cost' = '8388608'
);open-file-cost is critical. Raising it when many small files exist increases file-combine per task and improves plan efficiency.
5. Commit / Manifest
5.1 Commit Retry
| Property | Default | Description |
|---|---|---|
commit.retry.num-retries | 4 | Retries on OCC conflicts |
commit.retry.min-wait-ms | 100 | Min wait |
commit.retry.max-wait-ms | 60000 | Max wait |
commit.retry.total-timeout-ms | 1800000 (30m) | Total retry timeout |
commit.status-check.num-retries | 3 | Commit status check retries |
When streaming + batch write the same table concurrently, raising num-retries to 8–10 is safer.
5.2 Manifest Merging
| Property | Default | Description |
|---|---|---|
commit.manifest.target-size-bytes | 8388608 (8MB) | Target manifest size |
commit.manifest.min-count-to-merge | 100 | Merge once this many exist |
commit.manifest-merge.enabled | true | Auto-merge manifests on commit |
For tables with frequent streaming commits, lower min-count-to-merge to 50–80 to prevent manifest explosion proactively.
5.3 Snapshot Retention
| Property | Default | Description |
|---|---|---|
history.expire.max-snapshot-age-ms | 432000000 (5d) | Max snapshot retention |
history.expire.min-snapshots-to-keep | 1 | Min retained snapshots |
history.expire.max-ref-age-ms | (no limit) | Max branch/tag retention |
Tune per table — e.g., 30 days for master tables, 7 days for time-series facts.
6. Compaction Action Parameters
Options passed to rewrite_data_files.
| Option | Default | Description |
|---|---|---|
target-file-size-bytes | (table property) | Target file size |
min-input-files | 5 | Min files for a group to be processed |
min-file-size-bytes | (target × 0.75) | Candidate if smaller |
max-file-size-bytes | (target × 1.8) | Candidate if larger |
max-file-group-size-bytes | 107374182400 (100GB) | Max size per group |
max-concurrent-file-group-rewrites | 5 | Concurrent group rewrites |
partial-progress.enabled | false | Allow partial commits per group |
partial-progress.max-commits | 10 | Max partial commits |
rewrite-job-order | none | bytes-asc/bytes-desc/files-asc/files-desc |
CALL demo.system.rewrite_data_files(
table => 'db.orders',
strategy => 'sort',
sort_order => 'ts ASC NULLS LAST, customer_id ASC',
options => map(
'target-file-size-bytes', '536870912',
'max-concurrent-file-group-rewrites', '10',
'partial-progress.enabled', 'true',
'partial-progress.max-commits', '50',
'rewrite-job-order', 'bytes-asc'
)
);For large-table compaction, always set partial-progress.enabled=true — otherwise one failure can blow away hours of work.
6.1 Z-Order Compaction
CALL demo.system.rewrite_data_files(
table => 'db.orders',
strategy => 'sort',
sort_order => 'zorder(customer_id, ts)',
options => map('target-file-size-bytes', '536870912')
);Z-Order shines when two or more columns are combined in filters.
6.2 Delete File Compaction
CALL demo.system.rewrite_position_deletes(
table => 'db.orders',
options => map(
'rewrite-job-order', 'bytes-asc',
'partial-progress.enabled', 'true'
)
);For MoR production tables, delete compaction can matter more than data compaction.
6.3 Manifest Rewrite
CALL demo.system.rewrite_manifests(
table => 'db.orders',
use_caching => true
);If plan time has grown unusually long, this is the first thing to check.
7. Engine Settings — Spark
7.1 SparkSession Level
# Iceberg vectorized reader
spark.sql.iceberg.vectorization.enabled = true
# Timestamp handling (timezone-aware)
spark.sql.iceberg.handle-timestamp-without-timezone = false
# Catalog cache
spark.sql.catalog.demo.cache-enabled = true
spark.sql.catalog.demo.cache.expiration-interval-ms = 30000
# Adaptive Query Execution
spark.sql.adaptive.enabled = true
spark.sql.adaptive.coalescePartitions.enabled = true
spark.sql.adaptive.skewJoin.enabled = true
# Shuffle partitions (pairs with write distribution=hash)
spark.sql.shuffle.partitions = 8007.2 Write Options (DataFrame writer)
(df.writeTo("demo.db.orders")
.option("write-format", "parquet")
.option("target-file-size-bytes", "536870912")
.option("check-ordering", "false") # skip ordering check
.option("fanout-enabled", "true") # when many partitions
.append())7.3 Read Options
(spark.read.format("iceberg")
.option("split-size", "268435456")
.option("lookback", "20")
.option("file-open-cost", "8388608")
.option("vectorization-enabled", "true")
.load("demo.db.orders"))8. Engine Settings — Trino
Set in iceberg.properties or per catalog.
connector.name = iceberg
iceberg.catalog.type = rest
iceberg.rest-catalog.uri = http://iceberg-rest:8181
# Performance
iceberg.split-size = 128MB
iceberg.max-splits-per-second = 100
iceberg.use-file-size-from-metadata = true
# Manifest cache (critical)
iceberg.metadata-cache.enabled = true
iceberg.metadata-cache.max-size = 1000
iceberg.metadata-cache.ttl = 1h
iceberg.io.manifest.cache-enabled = true
iceberg.io.manifest.cache.max-total-bytes = 104857600
iceberg.io.manifest.cache.expiration-interval-ms = 60000
# Parquet
parquet.use-column-index = true
parquet.use-bloom-filter = true
parquet.max-read-block-size = 16MB
# Stats-driven CBO
iceberg.table-statistics-enabled = true
iceberg.extended-statistics.enabled = trueThe single biggest win on Trino is iceberg.io.manifest.cache-enabled — plan time on large tables can drop by more than half.
9. Engine Settings — Flink
CREATE TABLE orders_iceberg (
id BIGINT,
ts TIMESTAMP(3),
amount DECIMAL(10,2)
) WITH (
'connector' = 'iceberg',
'catalog-type' = 'rest',
'uri' = 'http://iceberg-rest:8181',
'warehouse' = 's3://my-warehouse',
-- Commit frequency (the most important knob)
'write-parallelism' = '8',
'upsert-enabled' = 'true',
'equality-field-columns' = 'id',
-- File size
'write.target-file-size-bytes' = '134217728', -- 128MB (smaller for streaming)
-- Checkpoint-based commit
'write.distribution-mode' = 'hash'
);In Flink, commits happen per checkpoint, so the checkpoint interval is effectively the commit interval.
# flink-conf.yaml
execution.checkpointing.interval = 5min
execution.checkpointing.min-pause = 1minToo frequent checkpoints explode manifests and snapshots; too rare hurt data freshness. 5–10 minutes is typical.
10. Per-Workload Recommended Profiles
10.1 Analytics-Heavy (Read-heavy, batch ingest)
ALTER TABLE db.orders SET TBLPROPERTIES (
'write.target-file-size-bytes' = '1073741824', -- 1GB
'write.parquet.compression-codec' = 'zstd',
'write.parquet.compression-level' = '6',
'write.parquet.row-group-size-bytes' = '268435456',
'write.delete.mode' = 'copy-on-write',
'write.distribution-mode' = 'range',
'read.split.target-size' = '536870912'
);10.2 Streaming Ingest + Batch Analytics
ALTER TABLE db.events SET TBLPROPERTIES (
'write.target-file-size-bytes' = '134217728', -- 128MB
'write.parquet.compression-codec' = 'zstd',
'write.parquet.compression-level' = '3',
'write.delete.mode' = 'merge-on-read',
'write.distribution-mode' = 'hash',
'commit.manifest.min-count-to-merge' = '50',
'write.metadata.delete-after-commit.enabled' = 'true',
'write.metadata.previous-versions-max' = '20'
);10.3 CDC (Frequent Upserts)
ALTER TABLE db.cdc_users SET TBLPROPERTIES (
'write.target-file-size-bytes' = '268435456',
'write.delete.mode' = 'merge-on-read',
'write.update.mode' = 'merge-on-read',
'write.merge.mode' = 'merge-on-read',
'write.delete.target-file-size-bytes' = '67108864',
'format-version' = '2',
'commit.manifest-merge.enabled' = 'true'
);If V3 is available, Deletion Vectors push read performance further.
11. Monitoring — What to Watch
Track these daily to catch issues early.
| Metric | Threshold (example) | Action |
|---|---|---|
| Files per table | > 100k | Compact |
| Average file size | < 64MB | Compact (target-file-size-bytes) |
| Manifest count | > 1000 | rewrite_manifests |
| Snapshot count | > 1000 | expire_snapshots |
metadata.json size | > 50MB | Compress metadata + expire snapshots |
| Delete file ratio | > 10% of data | rewrite_position_deletes |
| Plan time | 3× normal | Inspect manifest/cache |
| Commit retries | Avg > 1 | Audit concurrent writers |
You can query Iceberg's metadata tables directly.
-- File count, average size
SELECT count(*), avg(file_size_in_bytes)
FROM iceberg.db."orders$files";
-- Snapshot count
SELECT count(*) FROM iceberg.db."orders$snapshots";
-- Per-partition file stats
SELECT partition, file_count, total_size, position_delete_record_count
FROM iceberg.db."orders$partitions"
ORDER BY file_count DESC LIMIT 20;12. Checklist
Run through this whenever you create a new table.
- Is
write.target-file-size-bytesaligned with the workload? - Is
write.distribution-modewhat you intended (none/hash/range)? - Do
write.delete.mode/write.update.mode/write.merge.modematch each other? - Are
write.metadata.delete-after-commit.enabled = trueandprevious-versions-maxset? - Stats (
full) configured for frequently filtered columns? -
write.object-storage.enabledfor large tables to avoid S3 throttling? - Sort Order or Z-Order defined?
- Compaction, snapshot expire, orphan cleanup scheduled?
- Monitoring dashboard registered?
- (Trino)
iceberg.io.manifest.cache-enabled = true? - (Flink) Checkpoint interval ≥ 5 minutes?
13. Wrap-Up
The essence of Iceberg performance tuning is keeping three things in balance: file count, metadata size, and statistics quality. Hardly any problem yields to a single knob — you have to design write/read/commit/compaction together to fit the workload.
The most common problem is always the same — streaming commits too often, and compaction/expire jobs aren't running. Before reaching for new parameters, audit your operational automation first.
Related posts: