icebergperformancetuninglakehousesparktrino

Apache Iceberg Performance Tuning Guide — Pitfalls and Key Parameters

A practical guide to the pitfalls, anti-patterns, and per-category key table properties (write/read/commit/compaction/manifest) and engine settings (Spark/Trino/Flink) that determine Apache Iceberg performance.

Data DynamicsMay 23, 202612 min read

With Iceberg, "how performance changes over time" matters far more than the initial adoption. As metadata accumulates, small files pile up, delete files grow, and snapshots reach the tens of thousands, the same query can become tens of times slower. This guide consolidates the pitfalls and per-category key parameters that drive Iceberg performance from a practical standpoint.

1. The Iceberg Execution Model from a Performance Lens

Two essentials before diving into tuning.

1.1 What Happens on a Single Query

Loading diagram…

At each stage file count, file size, statistics quality, and sort order affect performance. Push one too hard and another collapses.

1.2 What Happens on a Single Write

Loading diagram…

Frequent commits explode manifests and snapshots. Write frequency and file size are practically the biggest perf determinants.

2. Top 10 Pitfalls

In rough order of impact.

2.1 Small File Accumulation

The most common issue in streaming/CDC. Tens of thousands of 1–10MB files push plan time into minutes.

Symptom: EXPLAIN shows tens of thousands of files; plan time abnormally large relative to scan time
Fix: Periodic rewrite_data_files, tune write.target-file-size-bytes

2.2 Skipping Snapshot Expiration

Accumulated snapshots bloat metadata.json to gigabytes and block GC of data files.

Symptom: metadata.json is tens of MB or more; opening the table itself is slow
Fix: Daily expire_snapshots

2.3 Over-Frequent Commits

Sub-minute commits in streaming explode manifest counts.

Recommended: Flink commit interval ≥ 1 minute, preferably 5–10 minutes
Fix: Periodic rewrite_manifests

2.4 Delete File Accumulation (MoR)

Accumulated V2 delete files force an anti-join on every read.

Symptom: Read time grows linearly over time
Fix: rewrite_position_deletes; adopt V3 Deletion Vectors

2.5 Bad Partitioning

Partitioning by high-cardinality columns (e.g., user_id): list cost explodes; manifests bloat
Time partitions too fine (e.g., hourly with <10k rows per bucket): file-count explosion
Missing partition column: full scan

2.6 Missing or Inaccurate Statistics

Without column stats (min/max, null count) — or with wrong ones — manifest pruning is neutered.

Cause: blanket stats collection across hundreds of columns → some get skipped
Fix: write.metadata.metrics.default + per-column metrics.column.<col>

2.7 Hadoop Catalog in Production

File-lock-based; risk of data loss on concurrent commit conflicts. Never use outside dev/test.

2.8 Manifest Bloat

Scanning thousands of manifests per query inflates plan time.

Fix: commit.manifest-merge.enabled=true, periodic rewrite_manifests

2.9 No Sort Order

Without sorting on frequently filtered columns, manifest min/max ranges are too wide to enable pruning.

2.10 Vectorized Reader Disabled per Engine

Disabling the Spark/Trino vectorized Parquet reader causes a 2–5× slowdown. It is ON by default but can get accidentally turned off in production.

3. Key Table Properties — Write

Set via TBLPROPERTIES. Focusing on the commonly used ones.

3.1 File Size and Format

Property	Default	Description
`write.format.default`	`parquet`	Data file format (`parquet`/`orc`/`avro`)
`write.target-file-size-bytes`	`536870912` (512MB)	Target file size. Analytics: 256MB–1GB
`write.parquet.row-group-size-bytes`	`134217728` (128MB)	Parquet row-group size; too large reduces skip benefit
`write.parquet.page-size-bytes`	`1048576` (1MB)	Parquet page size
`write.parquet.dict-size-bytes`	`2097152` (2MB)	Max dictionary encoding size
`write.parquet.compression-codec`	`zstd`	`zstd`/`snappy`/`gzip`/`lz4` — analytics: `zstd`
`write.parquet.compression-level`	(codec-specific)	zstd 1–9, typically 3
`write.parquet.bloom-filter-enabled.column.<col>`	`false`	For high-cardinality equality filters
`write.metadata.compression-codec`	`none`	Metadata compression; `gzip` for large tables
`write.object-storage.enabled`	`false`	S3 key-prefix randomization to avoid throttling

ALTER TABLE db.orders SET TBLPROPERTIES (
  'write.target-file-size-bytes' = '536870912',
  'write.parquet.compression-codec' = 'zstd',
  'write.parquet.compression-level' = '3',
  'write.metadata.compression-codec' = 'gzip',
  'write.object-storage.enabled' = 'true'
);

3.2 Distribution and Sorting

Property	Default	Description
`write.distribution-mode`	`hash` (Spark v3+)	`none`/`hash`/`range`. Trade-off between file count and skew
`write.delete.distribution-mode`	(same as write)	Delete file distribution
`write.update.distribution-mode`	(same as write)	Update distribution
`write.spark.fanout.enabled`	`false`	Per-partition fanout writer; enable when partitions are very many

hash groups data per partition, reducing small files but vulnerable to skew. range pairs with sorting for balanced distribution.

3.3 Row-Level Operation Modes

Property	Default (V2)	Description
`write.delete.mode`	`copy-on-write`	`copy-on-write` or `merge-on-read`
`write.update.mode`	`copy-on-write`	Same
`write.merge.mode`	`copy-on-write`	Same
`write.delete.target-file-size-bytes`	`67108864` (64MB)	Delete file size
`write.delete.parquet.compression-codec`	`zstd`	Delete file compression

Use MoR for CDC/frequent changes, CoW for batch ETL. Mismatched modes lead to surprising costs.

3.4 Metadata Cleanup

Property	Default	Description
`write.metadata.delete-after-commit.enabled`	`false`	Auto-delete old `metadata.json` after commit
`write.metadata.previous-versions-max`	`100`	Retained `metadata.json` versions

Production typically uses delete-after-commit=true with previous-versions-max=20.

3.5 Statistics Collection

Property	Default	Description
`write.metadata.metrics.default`	`truncate(16)`	Default stats mode for all columns
`write.metadata.metrics.column.<col>`	—	Per-column override (`none`/`counts`/`truncate(N)`/`full`)
`write.metadata.metrics.max-inferred-column-defaults`	`100`	Max columns to collect default stats

For wide tables with hundreds of columns, stats bloat metadata. Use full for the 5–10 frequently filtered columns and counts or none for the rest.

ALTER TABLE db.orders SET TBLPROPERTIES (
  'write.metadata.metrics.default' = 'counts',
  'write.metadata.metrics.column.order_id' = 'full',
  'write.metadata.metrics.column.ts' = 'full',
  'write.metadata.metrics.column.customer_id' = 'full'
);

4. Key Table Properties — Read

Property	Default	Description
`read.split.target-size`	`134217728` (128MB)	Bytes processed per task
`read.split.metadata-target-size`	`33554432` (32MB)	Metadata table split size
`read.split.planning-lookback`	`10`	Split packing lookback
`read.split.open-file-cost`	`4MB`	Estimated file-open cost (small-file combine threshold)
`read.parquet.vectorization.enabled`	`true`	Parquet vectorized reader
`read.parquet.vectorization.batch-size`	`5000`	Vectorized batch row count

ALTER TABLE db.orders SET TBLPROPERTIES (
  'read.split.target-size' = '268435456',
  'read.split.open-file-cost' = '8388608'
);

open-file-cost is critical. Raising it when many small files exist increases file-combine per task and improves plan efficiency.

5. Commit / Manifest

5.1 Commit Retry

Property	Default	Description
`commit.retry.num-retries`	`4`	Retries on OCC conflicts
`commit.retry.min-wait-ms`	`100`	Min wait
`commit.retry.max-wait-ms`	`60000`	Max wait
`commit.retry.total-timeout-ms`	`1800000` (30m)	Total retry timeout
`commit.status-check.num-retries`	`3`	Commit status check retries

When streaming + batch write the same table concurrently, raising num-retries to 8–10 is safer.

5.2 Manifest Merging

Property	Default	Description
`commit.manifest.target-size-bytes`	`8388608` (8MB)	Target manifest size
`commit.manifest.min-count-to-merge`	`100`	Merge once this many exist
`commit.manifest-merge.enabled`	`true`	Auto-merge manifests on commit

For tables with frequent streaming commits, lower min-count-to-merge to 50–80 to prevent manifest explosion proactively.

5.3 Snapshot Retention

Property	Default	Description
`history.expire.max-snapshot-age-ms`	`432000000` (5d)	Max snapshot retention
`history.expire.min-snapshots-to-keep`	`1`	Min retained snapshots
`history.expire.max-ref-age-ms`	(no limit)	Max branch/tag retention

Tune per table — e.g., 30 days for master tables, 7 days for time-series facts.

6. Compaction Action Parameters

Options passed to rewrite_data_files.

Option	Default	Description
`target-file-size-bytes`	(table property)	Target file size
`min-input-files`	`5`	Min files for a group to be processed
`min-file-size-bytes`	(target × 0.75)	Candidate if smaller
`max-file-size-bytes`	(target × 1.8)	Candidate if larger
`max-file-group-size-bytes`	`107374182400` (100GB)	Max size per group
`max-concurrent-file-group-rewrites`	`5`	Concurrent group rewrites
`partial-progress.enabled`	`false`	Allow partial commits per group
`partial-progress.max-commits`	`10`	Max partial commits
`rewrite-job-order`	`none`	`bytes-asc`/`bytes-desc`/`files-asc`/`files-desc`

CALL demo.system.rewrite_data_files(
  table => 'db.orders',
  strategy => 'sort',
  sort_order => 'ts ASC NULLS LAST, customer_id ASC',
  options => map(
    'target-file-size-bytes', '536870912',
    'max-concurrent-file-group-rewrites', '10',
    'partial-progress.enabled', 'true',
    'partial-progress.max-commits', '50',
    'rewrite-job-order', 'bytes-asc'
  )
);

For large-table compaction, always set partial-progress.enabled=true — otherwise one failure can blow away hours of work.

6.1 Z-Order Compaction

CALL demo.system.rewrite_data_files(
  table => 'db.orders',
  strategy => 'sort',
  sort_order => 'zorder(customer_id, ts)',
  options => map('target-file-size-bytes', '536870912')
);

Z-Order shines when two or more columns are combined in filters.

6.2 Delete File Compaction

CALL demo.system.rewrite_position_deletes(
  table => 'db.orders',
  options => map(
    'rewrite-job-order', 'bytes-asc',
    'partial-progress.enabled', 'true'
  )
);

For MoR production tables, delete compaction can matter more than data compaction.

6.3 Manifest Rewrite

CALL demo.system.rewrite_manifests(
  table => 'db.orders',
  use_caching => true
);

If plan time has grown unusually long, this is the first thing to check.

7. Engine Settings — Spark

7.1 SparkSession Level

# Iceberg vectorized reader
spark.sql.iceberg.vectorization.enabled = true
 
# Timestamp handling (timezone-aware)
spark.sql.iceberg.handle-timestamp-without-timezone = false
 
# Catalog cache
spark.sql.catalog.demo.cache-enabled       = true
spark.sql.catalog.demo.cache.expiration-interval-ms = 30000
 
# Adaptive Query Execution
spark.sql.adaptive.enabled                       = true
spark.sql.adaptive.coalescePartitions.enabled    = true
spark.sql.adaptive.skewJoin.enabled              = true
 
# Shuffle partitions (pairs with write distribution=hash)
spark.sql.shuffle.partitions = 800

7.2 Write Options (DataFrame writer)

(df.writeTo("demo.db.orders")
    .option("write-format", "parquet")
    .option("target-file-size-bytes", "536870912")
    .option("check-ordering", "false")     # skip ordering check
    .option("fanout-enabled", "true")      # when many partitions
    .append())

7.3 Read Options

(spark.read.format("iceberg")
    .option("split-size", "268435456")
    .option("lookback", "20")
    .option("file-open-cost", "8388608")
    .option("vectorization-enabled", "true")
    .load("demo.db.orders"))

8. Engine Settings — Trino

Set in iceberg.properties or per catalog.

connector.name = iceberg
iceberg.catalog.type = rest
iceberg.rest-catalog.uri = http://iceberg-rest:8181
 
# Performance
iceberg.split-size                          = 128MB
iceberg.max-splits-per-second               = 100
iceberg.use-file-size-from-metadata         = true
 
# Manifest cache (critical)
iceberg.metadata-cache.enabled              = true
iceberg.metadata-cache.max-size             = 1000
iceberg.metadata-cache.ttl                  = 1h
iceberg.io.manifest.cache-enabled           = true
iceberg.io.manifest.cache.max-total-bytes   = 104857600
iceberg.io.manifest.cache.expiration-interval-ms = 60000
 
# Parquet
parquet.use-column-index                    = true
parquet.use-bloom-filter                    = true
parquet.max-read-block-size                 = 16MB
 
# Stats-driven CBO
iceberg.table-statistics-enabled            = true
iceberg.extended-statistics.enabled         = true

The single biggest win on Trino is iceberg.io.manifest.cache-enabled — plan time on large tables can drop by more than half.

9. Engine Settings — Flink

CREATE TABLE orders_iceberg (
  id BIGINT,
  ts TIMESTAMP(3),
  amount DECIMAL(10,2)
) WITH (
  'connector' = 'iceberg',
  'catalog-type' = 'rest',
  'uri' = 'http://iceberg-rest:8181',
  'warehouse' = 's3://my-warehouse',
 
  -- Commit frequency (the most important knob)
  'write-parallelism' = '8',
  'upsert-enabled' = 'true',
  'equality-field-columns' = 'id',
 
  -- File size
  'write.target-file-size-bytes' = '134217728',   -- 128MB (smaller for streaming)
 
  -- Checkpoint-based commit
  'write.distribution-mode' = 'hash'
);

In Flink, commits happen per checkpoint, so the checkpoint interval is effectively the commit interval.

# flink-conf.yaml
execution.checkpointing.interval = 5min
execution.checkpointing.min-pause = 1min

Too frequent checkpoints explode manifests and snapshots; too rare hurt data freshness. 5–10 minutes is typical.

10. Per-Workload Recommended Profiles

10.1 Analytics-Heavy (Read-heavy, batch ingest)

ALTER TABLE db.orders SET TBLPROPERTIES (
  'write.target-file-size-bytes' = '1073741824',   -- 1GB
  'write.parquet.compression-codec' = 'zstd',
  'write.parquet.compression-level' = '6',
  'write.parquet.row-group-size-bytes' = '268435456',
  'write.delete.mode' = 'copy-on-write',
  'write.distribution-mode' = 'range',
  'read.split.target-size' = '536870912'
);

10.2 Streaming Ingest + Batch Analytics

ALTER TABLE db.events SET TBLPROPERTIES (
  'write.target-file-size-bytes' = '134217728',    -- 128MB
  'write.parquet.compression-codec' = 'zstd',
  'write.parquet.compression-level' = '3',
  'write.delete.mode' = 'merge-on-read',
  'write.distribution-mode' = 'hash',
  'commit.manifest.min-count-to-merge' = '50',
  'write.metadata.delete-after-commit.enabled' = 'true',
  'write.metadata.previous-versions-max' = '20'
);

10.3 CDC (Frequent Upserts)

ALTER TABLE db.cdc_users SET TBLPROPERTIES (
  'write.target-file-size-bytes' = '268435456',
  'write.delete.mode' = 'merge-on-read',
  'write.update.mode' = 'merge-on-read',
  'write.merge.mode' = 'merge-on-read',
  'write.delete.target-file-size-bytes' = '67108864',
  'format-version' = '2',
  'commit.manifest-merge.enabled' = 'true'
);

If V3 is available, Deletion Vectors push read performance further.

11. Monitoring — What to Watch

Track these daily to catch issues early.

Metric	Threshold (example)	Action
Files per table	> 100k	Compact
Average file size	< 64MB	Compact (`target-file-size-bytes`)
Manifest count	> 1000	`rewrite_manifests`
Snapshot count	> 1000	`expire_snapshots`
`metadata.json` size	> 50MB	Compress metadata + expire snapshots
Delete file ratio	> 10% of data	`rewrite_position_deletes`
Plan time	3× normal	Inspect manifest/cache
Commit retries	Avg > 1	Audit concurrent writers

You can query Iceberg's metadata tables directly.

-- File count, average size
SELECT count(*), avg(file_size_in_bytes)
FROM iceberg.db."orders$files";
 
-- Snapshot count
SELECT count(*) FROM iceberg.db."orders$snapshots";
 
-- Per-partition file stats
SELECT partition, file_count, total_size, position_delete_record_count
FROM iceberg.db."orders$partitions"
ORDER BY file_count DESC LIMIT 20;

12. Checklist

Run through this whenever you create a new table.

13. Wrap-Up

The essence of Iceberg performance tuning is keeping three things in balance: file count, metadata size, and statistics quality. Hardly any problem yields to a single knob — you have to design write/read/commit/compaction together to fit the workload.

The most common problem is always the same — streaming commits too often, and compaction/expire jobs aren't running. Before reaching for new parameters, audit your operational automation first.