Blog
icebergperformancetuninglakehousesparktrino

Apache Iceberg Performance Tuning Guide — Pitfalls and Key Parameters

A practical guide to the pitfalls, anti-patterns, and per-category key table properties (write/read/commit/compaction/manifest) and engine settings (Spark/Trino/Flink) that determine Apache Iceberg performance.

Data DynamicsMay 23, 202612 min read

With Iceberg, "how performance changes over time" matters far more than the initial adoption. As metadata accumulates, small files pile up, delete files grow, and snapshots reach the tens of thousands, the same query can become tens of times slower. This guide consolidates the pitfalls and per-category key parameters that drive Iceberg performance from a practical standpoint.


1. The Iceberg Execution Model from a Performance Lens

Two essentials before diving into tuning.

1.1 What Happens on a Single Query

Catalog lookup              → location of metadata.json
Read metadata.json          → determine current snapshot
Read manifest list          → manifest pruning via partition stats
Read manifest files (par.)  → data file pruning via column stats
Open data files (par.)      → Parquet footer + row-group skipping
Scan row groups             → decode only needed row groups

At each stage file count, file size, statistics quality, and sort order affect performance. Push one too hard and another collapses.

1.2 What Happens on a Single Write

Plan write                  → apply partition/sort policy
Write data files            → Parquet encoding + stats collection
Write delete files (MoR)    → position/equality deletes
Write manifest files        → entries for new files
Write manifest list         → combine existing + new manifests
Atomic commit               → catalog CAS swaps snapshot pointer

Frequent commits explode manifests and snapshots. Write frequency and file size are practically the biggest perf determinants.


2. Top 10 Pitfalls

In rough order of impact.

2.1 Small File Accumulation

The most common issue in streaming/CDC. Tens of thousands of 1–10MB files push plan time into minutes.

  • Symptom: EXPLAIN shows tens of thousands of files; plan time abnormally large relative to scan time
  • Fix: Periodic rewrite_data_files, tune write.target-file-size-bytes

2.2 Skipping Snapshot Expiration

Accumulated snapshots bloat metadata.json to gigabytes and block GC of data files.

  • Symptom: metadata.json is tens of MB or more; opening the table itself is slow
  • Fix: Daily expire_snapshots

2.3 Over-Frequent Commits

Sub-minute commits in streaming explode manifest counts.

  • Recommended: Flink commit interval ≥ 1 minute, preferably 5–10 minutes
  • Fix: Periodic rewrite_manifests

2.4 Delete File Accumulation (MoR)

Accumulated V2 delete files force an anti-join on every read.

  • Symptom: Read time grows linearly over time
  • Fix: rewrite_position_deletes; adopt V3 Deletion Vectors

2.5 Bad Partitioning

  • Partitioning by high-cardinality columns (e.g., user_id): list cost explodes; manifests bloat
  • Time partitions too fine (e.g., hourly with <10k rows per bucket): file-count explosion
  • Missing partition column: full scan

2.6 Missing or Inaccurate Statistics

Without column stats (min/max, null count) — or with wrong ones — manifest pruning is neutered.

  • Cause: blanket stats collection across hundreds of columns → some get skipped
  • Fix: write.metadata.metrics.default + per-column metrics.column.<col>

2.7 Hadoop Catalog in Production

File-lock-based; risk of data loss on concurrent commit conflicts. Never use outside dev/test.

2.8 Manifest Bloat

Scanning thousands of manifests per query inflates plan time.

  • Fix: commit.manifest-merge.enabled=true, periodic rewrite_manifests

2.9 No Sort Order

Without sorting on frequently filtered columns, manifest min/max ranges are too wide to enable pruning.

2.10 Vectorized Reader Disabled per Engine

Disabling the Spark/Trino vectorized Parquet reader causes a 2–5× slowdown. It is ON by default but can get accidentally turned off in production.


3. Key Table Properties — Write

Set via TBLPROPERTIES. Focusing on the commonly used ones.

3.1 File Size and Format

PropertyDefaultDescription
write.format.defaultparquetData file format (parquet/orc/avro)
write.target-file-size-bytes536870912 (512MB)Target file size. Analytics: 256MB–1GB
write.parquet.row-group-size-bytes134217728 (128MB)Parquet row-group size; too large reduces skip benefit
write.parquet.page-size-bytes1048576 (1MB)Parquet page size
write.parquet.dict-size-bytes2097152 (2MB)Max dictionary encoding size
write.parquet.compression-codeczstdzstd/snappy/gzip/lz4 — analytics: zstd
write.parquet.compression-level(codec-specific)zstd 1–9, typically 3
write.parquet.bloom-filter-enabled.column.<col>falseFor high-cardinality equality filters
write.metadata.compression-codecnoneMetadata compression; gzip for large tables
write.object-storage.enabledfalseS3 key-prefix randomization to avoid throttling
ALTER TABLE db.orders SET TBLPROPERTIES (
  'write.target-file-size-bytes' = '536870912',
  'write.parquet.compression-codec' = 'zstd',
  'write.parquet.compression-level' = '3',
  'write.metadata.compression-codec' = 'gzip',
  'write.object-storage.enabled' = 'true'
);

3.2 Distribution and Sorting

PropertyDefaultDescription
write.distribution-modehash (Spark v3+)none/hash/range. Trade-off between file count and skew
write.delete.distribution-mode(same as write)Delete file distribution
write.update.distribution-mode(same as write)Update distribution
write.spark.fanout.enabledfalsePer-partition fanout writer; enable when partitions are very many

hash groups data per partition, reducing small files but vulnerable to skew. range pairs with sorting for balanced distribution.

3.3 Row-Level Operation Modes

PropertyDefault (V2)Description
write.delete.modecopy-on-writecopy-on-write or merge-on-read
write.update.modecopy-on-writeSame
write.merge.modecopy-on-writeSame
write.delete.target-file-size-bytes67108864 (64MB)Delete file size
write.delete.parquet.compression-codeczstdDelete file compression

Use MoR for CDC/frequent changes, CoW for batch ETL. Mismatched modes lead to surprising costs.

3.4 Metadata Cleanup

PropertyDefaultDescription
write.metadata.delete-after-commit.enabledfalseAuto-delete old metadata.json after commit
write.metadata.previous-versions-max100Retained metadata.json versions

Production typically uses delete-after-commit=true with previous-versions-max=20.

3.5 Statistics Collection

PropertyDefaultDescription
write.metadata.metrics.defaulttruncate(16)Default stats mode for all columns
write.metadata.metrics.column.<col>Per-column override (none/counts/truncate(N)/full)
write.metadata.metrics.max-inferred-column-defaults100Max columns to collect default stats

For wide tables with hundreds of columns, stats bloat metadata. Use full for the 5–10 frequently filtered columns and counts or none for the rest.

ALTER TABLE db.orders SET TBLPROPERTIES (
  'write.metadata.metrics.default' = 'counts',
  'write.metadata.metrics.column.order_id' = 'full',
  'write.metadata.metrics.column.ts' = 'full',
  'write.metadata.metrics.column.customer_id' = 'full'
);

4. Key Table Properties — Read

PropertyDefaultDescription
read.split.target-size134217728 (128MB)Bytes processed per task
read.split.metadata-target-size33554432 (32MB)Metadata table split size
read.split.planning-lookback10Split packing lookback
read.split.open-file-cost4MBEstimated file-open cost (small-file combine threshold)
read.parquet.vectorization.enabledtrueParquet vectorized reader
read.parquet.vectorization.batch-size5000Vectorized batch row count
ALTER TABLE db.orders SET TBLPROPERTIES (
  'read.split.target-size' = '268435456',
  'read.split.open-file-cost' = '8388608'
);

open-file-cost is critical. Raising it when many small files exist increases file-combine per task and improves plan efficiency.


5. Commit / Manifest

5.1 Commit Retry

PropertyDefaultDescription
commit.retry.num-retries4Retries on OCC conflicts
commit.retry.min-wait-ms100Min wait
commit.retry.max-wait-ms60000Max wait
commit.retry.total-timeout-ms1800000 (30m)Total retry timeout
commit.status-check.num-retries3Commit status check retries

When streaming + batch write the same table concurrently, raising num-retries to 8–10 is safer.

5.2 Manifest Merging

PropertyDefaultDescription
commit.manifest.target-size-bytes8388608 (8MB)Target manifest size
commit.manifest.min-count-to-merge100Merge once this many exist
commit.manifest-merge.enabledtrueAuto-merge manifests on commit

For tables with frequent streaming commits, lower min-count-to-merge to 50–80 to prevent manifest explosion proactively.

5.3 Snapshot Retention

PropertyDefaultDescription
history.expire.max-snapshot-age-ms432000000 (5d)Max snapshot retention
history.expire.min-snapshots-to-keep1Min retained snapshots
history.expire.max-ref-age-ms(no limit)Max branch/tag retention

Tune per table — e.g., 30 days for master tables, 7 days for time-series facts.


6. Compaction Action Parameters

Options passed to rewrite_data_files.

OptionDefaultDescription
target-file-size-bytes(table property)Target file size
min-input-files5Min files for a group to be processed
min-file-size-bytes(target × 0.75)Candidate if smaller
max-file-size-bytes(target × 1.8)Candidate if larger
max-file-group-size-bytes107374182400 (100GB)Max size per group
max-concurrent-file-group-rewrites5Concurrent group rewrites
partial-progress.enabledfalseAllow partial commits per group
partial-progress.max-commits10Max partial commits
rewrite-job-ordernonebytes-asc/bytes-desc/files-asc/files-desc
CALL demo.system.rewrite_data_files(
  table => 'db.orders',
  strategy => 'sort',
  sort_order => 'ts ASC NULLS LAST, customer_id ASC',
  options => map(
    'target-file-size-bytes', '536870912',
    'max-concurrent-file-group-rewrites', '10',
    'partial-progress.enabled', 'true',
    'partial-progress.max-commits', '50',
    'rewrite-job-order', 'bytes-asc'
  )
);

For large-table compaction, always set partial-progress.enabled=true — otherwise one failure can blow away hours of work.

6.1 Z-Order Compaction

CALL demo.system.rewrite_data_files(
  table => 'db.orders',
  strategy => 'sort',
  sort_order => 'zorder(customer_id, ts)',
  options => map('target-file-size-bytes', '536870912')
);

Z-Order shines when two or more columns are combined in filters.

6.2 Delete File Compaction

CALL demo.system.rewrite_position_deletes(
  table => 'db.orders',
  options => map(
    'rewrite-job-order', 'bytes-asc',
    'partial-progress.enabled', 'true'
  )
);

For MoR production tables, delete compaction can matter more than data compaction.

6.3 Manifest Rewrite

CALL demo.system.rewrite_manifests(
  table => 'db.orders',
  use_caching => true
);

If plan time has grown unusually long, this is the first thing to check.


7. Engine Settings — Spark

7.1 SparkSession Level

# Iceberg vectorized reader
spark.sql.iceberg.vectorization.enabled = true
 
# Timestamp handling (timezone-aware)
spark.sql.iceberg.handle-timestamp-without-timezone = false
 
# Catalog cache
spark.sql.catalog.demo.cache-enabled       = true
spark.sql.catalog.demo.cache.expiration-interval-ms = 30000
 
# Adaptive Query Execution
spark.sql.adaptive.enabled                       = true
spark.sql.adaptive.coalescePartitions.enabled    = true
spark.sql.adaptive.skewJoin.enabled              = true
 
# Shuffle partitions (pairs with write distribution=hash)
spark.sql.shuffle.partitions = 800

7.2 Write Options (DataFrame writer)

(df.writeTo("demo.db.orders")
    .option("write-format", "parquet")
    .option("target-file-size-bytes", "536870912")
    .option("check-ordering", "false")     # skip ordering check
    .option("fanout-enabled", "true")      # when many partitions
    .append())

7.3 Read Options

(spark.read.format("iceberg")
    .option("split-size", "268435456")
    .option("lookback", "20")
    .option("file-open-cost", "8388608")
    .option("vectorization-enabled", "true")
    .load("demo.db.orders"))

8. Engine Settings — Trino

Set in iceberg.properties or per catalog.

connector.name = iceberg
iceberg.catalog.type = rest
iceberg.rest-catalog.uri = http://iceberg-rest:8181
 
# Performance
iceberg.split-size                          = 128MB
iceberg.max-splits-per-second               = 100
iceberg.use-file-size-from-metadata         = true
 
# Manifest cache (critical)
iceberg.metadata-cache.enabled              = true
iceberg.metadata-cache.max-size             = 1000
iceberg.metadata-cache.ttl                  = 1h
iceberg.io.manifest.cache-enabled           = true
iceberg.io.manifest.cache.max-total-bytes   = 104857600
iceberg.io.manifest.cache.expiration-interval-ms = 60000
 
# Parquet
parquet.use-column-index                    = true
parquet.use-bloom-filter                    = true
parquet.max-read-block-size                 = 16MB
 
# Stats-driven CBO
iceberg.table-statistics-enabled            = true
iceberg.extended-statistics.enabled         = true

The single biggest win on Trino is iceberg.io.manifest.cache-enabled — plan time on large tables can drop by more than half.


CREATE TABLE orders_iceberg (
  id BIGINT,
  ts TIMESTAMP(3),
  amount DECIMAL(10,2)
) WITH (
  'connector' = 'iceberg',
  'catalog-type' = 'rest',
  'uri' = 'http://iceberg-rest:8181',
  'warehouse' = 's3://my-warehouse',
 
  -- Commit frequency (the most important knob)
  'write-parallelism' = '8',
  'upsert-enabled' = 'true',
  'equality-field-columns' = 'id',
 
  -- File size
  'write.target-file-size-bytes' = '134217728',   -- 128MB (smaller for streaming)
 
  -- Checkpoint-based commit
  'write.distribution-mode' = 'hash'
);

In Flink, commits happen per checkpoint, so the checkpoint interval is effectively the commit interval.

# flink-conf.yaml
execution.checkpointing.interval = 5min
execution.checkpointing.min-pause = 1min

Too frequent checkpoints explode manifests and snapshots; too rare hurt data freshness. 5–10 minutes is typical.


10.1 Analytics-Heavy (Read-heavy, batch ingest)

ALTER TABLE db.orders SET TBLPROPERTIES (
  'write.target-file-size-bytes' = '1073741824',   -- 1GB
  'write.parquet.compression-codec' = 'zstd',
  'write.parquet.compression-level' = '6',
  'write.parquet.row-group-size-bytes' = '268435456',
  'write.delete.mode' = 'copy-on-write',
  'write.distribution-mode' = 'range',
  'read.split.target-size' = '536870912'
);

10.2 Streaming Ingest + Batch Analytics

ALTER TABLE db.events SET TBLPROPERTIES (
  'write.target-file-size-bytes' = '134217728',    -- 128MB
  'write.parquet.compression-codec' = 'zstd',
  'write.parquet.compression-level' = '3',
  'write.delete.mode' = 'merge-on-read',
  'write.distribution-mode' = 'hash',
  'commit.manifest.min-count-to-merge' = '50',
  'write.metadata.delete-after-commit.enabled' = 'true',
  'write.metadata.previous-versions-max' = '20'
);

10.3 CDC (Frequent Upserts)

ALTER TABLE db.cdc_users SET TBLPROPERTIES (
  'write.target-file-size-bytes' = '268435456',
  'write.delete.mode' = 'merge-on-read',
  'write.update.mode' = 'merge-on-read',
  'write.merge.mode' = 'merge-on-read',
  'write.delete.target-file-size-bytes' = '67108864',
  'format-version' = '2',
  'commit.manifest-merge.enabled' = 'true'
);

If V3 is available, Deletion Vectors push read performance further.


11. Monitoring — What to Watch

Track these daily to catch issues early.

MetricThreshold (example)Action
Files per table> 100kCompact
Average file size< 64MBCompact (target-file-size-bytes)
Manifest count> 1000rewrite_manifests
Snapshot count> 1000expire_snapshots
metadata.json size> 50MBCompress metadata + expire snapshots
Delete file ratio> 10% of datarewrite_position_deletes
Plan time3× normalInspect manifest/cache
Commit retriesAvg > 1Audit concurrent writers

You can query Iceberg's metadata tables directly.

-- File count, average size
SELECT count(*), avg(file_size_in_bytes)
FROM iceberg.db."orders$files";
 
-- Snapshot count
SELECT count(*) FROM iceberg.db."orders$snapshots";
 
-- Per-partition file stats
SELECT partition, file_count, total_size, position_delete_record_count
FROM iceberg.db."orders$partitions"
ORDER BY file_count DESC LIMIT 20;

12. Checklist

Run through this whenever you create a new table.

  • Is write.target-file-size-bytes aligned with the workload?
  • Is write.distribution-mode what you intended (none/hash/range)?
  • Do write.delete.mode / write.update.mode / write.merge.mode match each other?
  • Are write.metadata.delete-after-commit.enabled = true and previous-versions-max set?
  • Stats (full) configured for frequently filtered columns?
  • write.object-storage.enabled for large tables to avoid S3 throttling?
  • Sort Order or Z-Order defined?
  • Compaction, snapshot expire, orphan cleanup scheduled?
  • Monitoring dashboard registered?
  • (Trino) iceberg.io.manifest.cache-enabled = true?
  • (Flink) Checkpoint interval ≥ 5 minutes?

13. Wrap-Up

The essence of Iceberg performance tuning is keeping three things in balance: file count, metadata size, and statistics quality. Hardly any problem yields to a single knob — you have to design write/read/commit/compaction together to fit the workload.

The most common problem is always the same — streaming commits too often, and compaction/expire jobs aren't running. Before reaching for new parameters, audit your operational automation first.

Related posts: