Apache Iceberg Whitepaper — Structure and Adoption Strategy for the Next-Generation Lakehouse Table Format
A whitepaper for data architects and platform leaders covering Apache Iceberg's metadata structure, operating model, multi-engine compatibility, and adoption strategy. Spans comparisons with Hive, Delta Lake, and Hudi; migration patterns; operational automation; and the 2026 outlook.
Scope of this whitepaper
This paper covers Apache Iceberg from two angles: "why adopt it" and "how to operate it." Aimed at data architects, platform leads, and CDOs choosing and adopting table formats in multi-engine Lakehouse environments, it goes beyond the spec to cover operational automation, catalog topology, and migration patterns from a practitioner perspective.
1. Executive Summary
1.1 One-line summary
Apache Iceberg is an open table format that brings database-level transactions and evolution semantics on top of object storage, and as of 2026 it has become the de facto standard for multi-engine Lakehouses.
1.2 Key conclusions
- As the Lakehouse has moved from "an ACID layer for a single engine" to "shared tables for many engines," the value of Iceberg's engine-neutral specification and the REST Catalog standard has come into sharp focus.
- Iceberg is not a "storage format" but a "metadata structure." It is a specification that adds snapshot-based isolation, hidden partitioning, schema/partition evolution, time travel, and branches/tags on top of file formats like Parquet, ORC, and Avro.
- In a single-engine environment (e.g., Databricks-only), Delta Lake is still a reasonable choice, but for workloads with two or more engines in play, or where long-term retention, legal correction, and experiment isolation matter, Iceberg has a clear advantage.
- Operating costs are not negligible. Without automating compaction, snapshot expiration, orphan file cleanup, and catalog operations, Iceberg quickly turns into a "metadata swamp." This paper presents the standard operational-automation patterns alongside the spec.
- With Iceberg V3 and the REST Catalog standard consolidating, the catalog layer will become the true control plane of the Lakehouse over the next three years.
1.3 Who benefits, and from which decisions
| Reader | What this paper provides |
|---|---|
| Data platform architects | Metadata layering, catalog topology, engine compatibility matrix |
| Data engineering leads | Operational automation patterns, compaction and cleanup standards, monitoring metrics |
| CDO / data executives | Adoption decision tree, migration risk and timeline estimates, cost structure |
| ML / analytics leads | Experimentation and reproducibility patterns using time travel and branches/tags |
2. Background: Why Yet Another Table Format
2.1 Structural limits of the Hive era
Through the mid-2010s, the de facto standard across the Hadoop and Spark ecosystem was Hive Metastore (HMS) + directory-based partitioning. It was simple, but its structural limits accumulated as scale grew.
- Directory = partition = query predicate. Partition pruning only worked when a predicate like
WHERE event_date = '2026-05-20'matched the directory path (/event_date=2026-05-20/) exactly. Once a column was renamed or a user wrote an expression likeWHERE ts > ..., pruning collapsed. - No partition evolution. A decision like "switch from daily to hourly partitions" effectively meant rewriting the table. For a petabyte-scale table in production, that was nearly impossible.
- Schema evolution by external convention. Hive supported column add/rename to some degree, but position-based mapping without column IDs made column reordering and rename risky.
- No atomicity. Directory rename was used to fake "atomic publish," but on object stores like S3, rename is implemented as non-atomic copying — yielding partial visibility and duplication.
- No read-write concurrency. Reading the same partition while another job wrote to it would expose partial files or inconsistent ListBucket results.
2.2 The cloud + object storage shift was the breaking point
S3, ADLS, and GCS differed from HDFS in two decisive ways.
- List is expensive and weakly consistent. Hive's structure made tens of thousands of
s3 listcalls to identify files, and S3 list takes minutes once a prefix carries hundreds of thousands of keys. - Rename is effectively copy + delete. HDFS directory rename was a metadata operation; in S3 it is per-object copying. A job failure mid-run leaves a "half-written" state visible.
In short, Hive was designed on the premise that "the filesystem provides strong consistency and fast renames," and the cloud broke that premise.
2.3 New formats arrive — Iceberg, Delta Lake, Hudi
Three formats arrived nearly simultaneously in this era.
| Format | Started by | Design starting point |
|---|---|---|
| Apache Iceberg | 2017, Netflix | "You must be able to know which files belong to a table without listing directories." |
| Delta Lake | 2017, Databricks | "Extract Spark's transaction log so we can layer ACID on top." |
| Apache Hudi | 2016, Uber | "Incremental processing optimized for streaming/upsert workloads." |
All three solve the problem with a metadata layer, but the structure of that metadata and the operating model differ — which in turn drives the differences in which workloads each format handles best.
2.4 How Iceberg's reception evolved
- 2018–2020: "Netflix's internal project." Interesting use cases, limited adoption.
- 2020–2022: Promoted to Apache Top-Level (2020). AWS Athena/EMR, Snowflake, Trino, and Flink added support in turn.
- 2023–2024: The REST Catalog spec v1 was agreed; Snowflake Polaris and Databricks Unity began handling Iceberg directly, cementing the role of "a shared format across engines."
- 2025–2026: Iceberg V3 (Variant type, Geospatial, default values, deletion vectors) lands, the spec grows richer, and the REST Catalog becomes the de facto catalog standard.
3. Iceberg Architecture in Depth
3.1 The three-layer model
To guarantee the same answer regardless of which engine reads the table, Iceberg separates table state into three layers.
The consequences of this separation are clear:
- Reading = no directory listing. You always walk the
metadata.json → manifest list → manifest → data filetree. You are free from list cost and consistency problems. - Writing = create a new metadata.json and atomically swap the pointer in the catalog. Old data files and the old metadata.json remain in place, so other readers don't break.
- The data files themselves carry no notion of "what shape the table should be." Column names, types, partitions — all of it lives in the metadata layer. That's why schema and partition evolution happen without data rewrite.
3.2 What's actually inside metadata.json
Roughly the following structure (V2-based, simplified).
{
"format-version": 2,
"table-uuid": "5f8a...e9",
"location": "s3://bucket/warehouse/db/events",
"last-updated-ms": 1747804800000,
"last-column-id": 12,
"schemas": [
{
"schema-id": 0,
"fields": [
{ "id": 1, "name": "event_id", "required": true, "type": "long" },
{ "id": 2, "name": "user_id", "required": false, "type": "long" },
{ "id": 3, "name": "event_ts", "required": true, "type": "timestamptz" },
{ "id": 4, "name": "event_type", "required": true, "type": "string" }
]
}
],
"current-schema-id": 0,
"partition-specs": [
{
"spec-id": 0,
"fields": [
{ "name": "event_day", "source-id": 3, "transform": "day", "field-id": 1000 }
]
}
],
"default-spec-id": 0,
"sort-orders": [ ... ],
"current-snapshot-id": 8123412345678901234,
"snapshots": [
{
"snapshot-id": 8123412345678901234,
"timestamp-ms": 1747804790000,
"summary": {
"operation": "append",
"added-data-files": "3",
"added-records": "1450000"
},
"manifest-list": "s3://.../snap-8123-1-abc.avro",
"schema-id": 0
}
],
"refs": {
"main": { "snapshot-id": 8123..., "type": "branch" },
"wap-2026-05-20": { "snapshot-id": 8123..., "type": "tag" }
}
}Key observations:
- The
idinsidefieldsis the real identifier of a column. The name can change while the ID stays the same. partition-specsis an array. That is, a table can carry different partition specs over its lifetime.snapshotsare cumulative. All of them are kept (until expired) for time travel.refsare Git-like branch/tag references. Beyondmain, user-defined branches and tags are treated as first-class.
3.3 The division of labor between manifest list and manifest
To prune files quickly at query time, data file information is summarized in two stages.
Query pruning flow (WHERE event_day = '2026-05-20' AND user_id = 42)
- Get the current metadata.json location from the catalog.
- metadata.json → read the manifest list of the current snapshot.
- For each row of the manifest list (= one manifest file), first-pass pruning using the summarized partition ranges. Skip manifests that don't cover
event_day = '2026-05-20'. - Read only the surviving manifests. Use each data file's
user_idlower/upper bound for second-pass pruning. - Open only the remaining data files.
What this gives you:
- Even tables with hundreds of thousands of data files only need to examine tens to hundreds per query.
- No directory listing required, so you are not dependent on S3 list cost or consistency.
- Statistics (lower/upper/null) are condensed into the manifests, so you don't need to open every Parquet footer.
3.4 Hidden partitioning
One of the biggest pain points of the Hive era was "the partition expression and the query predicate must match exactly." Iceberg solves this by registering a transform on the metadata.
-- DDL: partition event_ts by day (data only stores event_ts)
CREATE TABLE events (
event_id BIGINT,
user_id BIGINT,
event_ts TIMESTAMP,
event_type STRING
)
USING iceberg
PARTITIONED BY (days(event_ts));
-- Query: you don't need to know the partition column
SELECT count(*)
FROM events
WHERE event_ts >= TIMESTAMP '2026-05-20 00:00:00'
AND event_ts < TIMESTAMP '2026-05-21 00:00:00';- Only
event_tsis stored in the data; the partition keyevent_dayonly exists in the metadata. - Even when the predicate is on
event_ts, the engine understands the partition spec's transform (days(event_ts)) and performs partition pruning automatically. - The user does not have to write a condition like
event_day = .... The "partition key" is hidden from the user.
Supported transforms: identity, bucket(N, col), truncate(W, col), year, month, day, hour, void.
3.5 Schema evolution
Because Iceberg assigns columns permanent IDs, the following changes happen without rewriting data:
| Operation | Safety | Notes |
|---|---|---|
| Add column | Safe | Existing rows are NULL or default |
| Drop column | Safe | The ID disappears from metadata only; data files are untouched |
| Rename column | Safe | ID is preserved; metadata.json updates only the name |
| Reorder column | Safe | Only the field order in metadata changes |
| Type widening | Partially safe | int → long, float → double, decimal precision increases, etc. |
| Type narrowing | Not allowed | Lossy conversions like long → int are forbidden |
| nullable → required | Conditional | Must verify that every existing row is non-null |
What makes this possible is field-ID-based mapping in Iceberg, not Parquet's position-based mapping. Column IDs are written alongside the data file, so even if the name changes in metadata, the consistent column is found.
3.6 Partition evolution
Iceberg allows partition specs to be added over time. Patterns like "start daily, switch to hourly once traffic grows" are operationally feasible.
-- Initially: daily partitions
ALTER TABLE events SET TBLPROPERTIES (...);
-- partition spec id = 0 : (day(event_ts))
-- Switch to hourly mid-flight
ALTER TABLE events
REPLACE PARTITION FIELD event_ts WITH hours(event_ts);
-- partition spec id = 1 : (hour(event_ts))Operational notes:
- Historical data remains under the old partition spec, and only new writes use the new spec.
- Two partition specs therefore coexist within one table. Queries work correctly against both, but pruning efficiency differs by spec.
- If needed,
rewrite_data_filescan rewrite old data to the new spec (with data-movement cost).
3.7 V1 vs V2 vs V3 — spec evolution
| Item | V1 | V2 | V3 (2025+) |
|---|---|---|---|
| Standardized | 2018+ | 2021+ | 2025+ |
| Row-level delete | Impossible (CoW only) | Position / Equality delete file | Deletion vectors (Puffin) |
| Sequence number | None | Yes (essential for consistency) | Retained |
| Column default value | No | No | Yes |
| Variant type | No | No | Yes (semi-structured) |
| Geospatial type | No | No | Yes |
| Row Lineage | No | No | Yes (CDC-friendly) |
What V2 brought — row-level deletes:
- In V1, deleting one row meant rewriting the whole file it lived in (Copy-on-Write, CoW).
- V2 introduced two kinds of delete file that allow expressing the change as "old file + delete marker" (Merge-on-Read, MoR).
- Position delete:
(file path, row position). Best suited to CDC and MERGE workloads. - Equality delete:
(column = value). Best for key-based deletes.
- Position delete:
- At read time the engine applies the deletes in memory. Without frequent compaction, reads slow down.
What V3 brings:
- Deletion Vectors (Puffin format) — A more efficient representation of V2 position deletes. A Roaring bitmap reduces memory and disk usage.
- Variant type — Stores semi-structured data like JSON with a consistent encoding; engines interpret it the same way.
- Row Lineage — Assigns a stable ID per row, useful for CDC and ML reproducibility.
3.8 Copy-on-Write vs Merge-on-Read
After V2 the most important operational decision is whether to use CoW or MoR per table.
| Aspect | Copy-on-Write (CoW) | Merge-on-Read (MoR) |
|---|---|---|
| UPDATE/DELETE behavior | Rewrite affected files | Keep old files + add delete files |
| Write cost | High (large file rewrite) | Low (only small delete files) |
| Read cost | Low (no delete to apply) | High (deletes applied in memory) |
| Compaction/sort state | Maintained immediately | Degrades over time, needs compaction |
| Best for | Analytics-heavy, infrequent corrections | CDC, GDPR corrections, frequent upserts |
Operational recommendations:
- Analytics-heavy tables (e.g., daily aggregates, star-schema facts) — CoW recommended.
- CDC/MERGE patterns (e.g., real-time user-state tables) — MoR + periodic compaction recommended.
- Set the mode explicitly with table properties:
ALTER TABLE events SET TBLPROPERTIES ( 'write.delete.mode'='merge-on-read', 'write.update.mode'='merge-on-read', 'write.merge.mode' ='merge-on-read' );
3.9 Time travel, branches, tags
Iceberg's snapshot model naturally extends to Git-like data version control.
-- Time travel (specific timestamp)
SELECT * FROM events FOR SYSTEM_TIME AS OF '2026-05-20 09:00:00';
-- By snapshot ID directly
SELECT * FROM events VERSION AS OF 8123412345678901234;
-- Create a branch (Write-Audit-Publish pattern)
ALTER TABLE events CREATE BRANCH `wap-2026-05-20`;
-- Write changes onto the branch and verify
INSERT INTO events.`wap-2026-05-20` SELECT ...;
-- On success, fast-forward main
ALTER TABLE events FAST FORWARD `main` TO `wap-2026-05-20`;
-- Tag (permanent retention point)
ALTER TABLE events CREATE TAG `q1-2026-close`
AS OF VERSION 8123412345678901234
RETAIN 365 DAYS;Patterns:
- Write-Audit-Publish (WAP) — Write new data to a branch first, merge to main only after quality checks (dbt test, Great Expectations) pass. If validation fails, simply discard the branch — preventing the risk of "even briefly exposing bad data on main."
- ML experiment isolation — Pin training snapshots with tags (e.g.,
model-v3-train). Six months later you can retrain on identical data. - Legal correction and audit — Tag the pre-correction state so auditors can see "before and after" the correction.
4. Iceberg from an Operations Perspective
4.1 The standard maintenance job set
Operating Iceberg means automating these four maintenance jobs.
| Job | What it does | Recommended frequency |
|---|---|---|
rewrite_data_files | Merges small files into large ones; applies a sort order | Daily–weekly |
rewrite_manifests | Reorganizes manifests to restore pruning efficiency | Weekly–monthly |
expire_snapshots | Removes old snapshots and files only those referenced | Daily |
remove_orphan_files | Removes data/metadata files not referenced by any metadata | Weekly–monthly |
What happens if you skip them:
- Small file explosion — A streaming job committing every 10 seconds creates 8,640 files/day. After a month that's 260,000. Each query must read thousands of manifests.
- Metadata explosion — Tens of thousands of accumulated snapshots make metadata.json grow into tens of MBs, and every write rewrites it in full — commits slow down.
- Storage cost explosion — Without expiration, a table that updated one row a hundred million times ends up storing tens of times the source volume.
4.2 Compaction design
-- Spark SQL: basic compaction
CALL system.rewrite_data_files(
table => 'db.events',
options => map(
'min-input-files', '5',
'target-file-size-bytes','536870912', -- 512 MiB
'rewrite-all', 'false'
)
);
-- Add a sort order
CALL system.rewrite_data_files(
table => 'db.events',
strategy => 'sort',
sort_order => 'event_ts ASC, user_id ASC'
);Design principles:
- target-file-size between 256–1024 MiB. Too small and you pay list/open cost; too large and shuffle/memory pressure grows.
- Sort by the columns most often used for pruning — maximizes lower/upper bound efficiency.
- MoR tables also apply delete files in the same pass during compaction, so periodic compaction is the key to maintaining query performance.
- Job-size control — Don't rewrite every file at once. Compact incrementally by partition or time range. Use the
whereoption to scope.
4.3 Snapshot expiration and cleanup
-- Expire snapshots older than 7 days, keep at least 5
CALL system.expire_snapshots(
table => 'db.events',
older_than => TIMESTAMP '2026-05-13 00:00:00',
retain_last => 5
);
-- Remove orphan files not referenced by any snapshot (older than 3 days)
CALL system.remove_orphan_files(
table => 'db.events',
older_than => TIMESTAMP '2026-05-17 00:00:00'
);Operational recommendations:
- Be conservative with
older_than— Long-running jobs (e.g., 6-hour backfills) might still reference old snapshots. remove_orphan_filesrequires care — A wrong invocation can delete files another job just wrote. Validate withdry_run => truefirst.- Define a time-travel SLA — A policy like "we will not restore data older than 30 days" makes expiration thresholds clear.
4.4 Catalog selection
Iceberg abstracts catalogs, but the catalog you actually pick drives the operating model.
| Catalog | Best for | Limitations |
|---|---|---|
| Hive Metastore (HMS) | Gradual adoption on top of existing Hive assets | Weak consistency and permission model in multi-engine settings |
| AWS Glue | Single-cloud AWS with Athena/EMR/Redshift integration | Awkward to use outside AWS |
| REST Catalog | Multi-engine, multi-cloud on a standardized spec | Self-hosting / operations burden; need to pick a backend implementation |
| Project Nessie | Git-like data versioning (branch/merge) | Limited permissioning and SaaS options |
| Snowflake Polaris | Multi-engine sharing in Snowflake-centric environments | Some Snowflake coupling remains |
| Databricks Unity Catalog | Databricks-centric environments where UC handles Iceberg as first-class | Engines outside UC need separate REST adapters |
Recommended pattern: For a new multi-engine environment, put a backend that implements the REST Catalog spec (Apache Polaris, Tabular OSS, Lakekeeper, Apache Gravitino, Unity Catalog OSS) in front, and let Spark, Trino, Flink, BigQuery, and Snowflake all access tables through it.
4.5 Write modes and distribution tuning
ALTER TABLE events SET TBLPROPERTIES (
'write.distribution-mode' = 'hash', -- 'none' | 'hash' | 'range'
'write.target-file-size-bytes' = '536870912', -- 512 MiB
'write.parquet.compression-codec' = 'zstd',
'write.parquet.row-group-size-bytes' = '134217728',
'commit.retry.num-retries' = '8',
'commit.retry.min-wait-ms' = '500'
);Key parameters:
write.distribution-modenone— Input distribution preserved. Lots of small files appear.hash— Hash on the partition columns. Generally recommended. File count per partition becomes uniform.range— Sorted distribution. Useful for time-ordered log workloads.
commit.retry.*— Retry policy on optimistic concurrency control failures with many writers. Raise the values when conflicts are frequent.
4.6 Monitoring metrics
To know whether an Iceberg table is healthy, watch these metrics regularly.
| Metric | Meaning | Red flag |
|---|---|---|
| Average file count per partition | Compaction effect | Over 100 |
| Average file size | Compaction / write distribution | Under 32 MiB |
| Cumulative snapshot count | Expiration policy working | Over 1,000 |
| metadata.json size | Metadata bloat signal | Above 8 MiB |
| Average manifest size and count | Pruning efficiency | Over 5,000 manifests |
| Average commit latency | Catalog / concurrency issue | p95 above 5 s |
| delete file / data file ratio | Need for MoR compaction | Above 5% → compact |
Periodically extract these values from the catalog and metadata and put them on a dashboard — that's standard operations practice. Iceberg's system tables (db.events.files, db.events.snapshots, db.events.manifests) can be used directly.
-- File statistics
SELECT
partition,
count(*) AS file_count,
avg(file_size_in_bytes) AS avg_size,
sum(file_size_in_bytes) AS total_size
FROM db.events.files
GROUP BY partition
ORDER BY file_count DESC;
-- Snapshot accumulation
SELECT count(*) FROM db.events.snapshots;4.7 The shape of operational automation
A mature Iceberg operations team needs this set of automated jobs:
- Daily compaction job —
rewrite_data_fileson yesterday's partitions (only partitions whose small-file count crosses the threshold). - Daily expiration job — Expire snapshots older than N days; always keep a fixed number.
- Weekly manifest rewrite job —
rewrite_manifests. - Monthly orphan cleanup job —
remove_orphan_files(dry run → validate → execute). - Table-health report job — Extract the monitoring metrics above and emit dashboards and alerts.
All these jobs must be idempotent and safely retryable on failure. In large environments, it is standard practice to extract this automation into a dedicated "table management service" and operate it alongside the catalog.
5. Engine Compatibility
5.1 Core engine support (as of 2026)
| Engine | Read | Write | DML | Time travel | Branch/Tag | V2 (MoR) | V3 | REST Catalog |
|---|---|---|---|---|---|---|---|---|
| Apache Spark | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | In progress | ✓ |
| Trino | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Partial | ✓ |
| Apache Flink | ✓ | ✓ (streaming) | Partial | ✓ | Partial | ✓ | In progress | ✓ |
| Snowflake | ✓ | ✓ | ✓ | ✓ | Limited | ✓ | In progress | ✓ (Polaris) |
| Databricks (Unity) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | In progress | ✓ (UC OSS) |
| BigQuery | ✓ | Partial | Partial | ✓ | Partial | ✓ | In progress | ✓ (BigLake) |
| AWS Athena | ✓ | ✓ | ✓ | ✓ | Partial | ✓ | In progress | ✓ (Glue) |
| ClickHouse | ✓ | Experimental | ✗ | ✓ | ✗ | Partial | ✗ | ✓ |
| DuckDB | ✓ | Experimental | ✗ | ✓ | ✗ | Partial | ✗ | ✓ |
| PyIceberg | ✓ | ✓ | Partial | ✓ | ✓ | ✓ | In progress | ✓ |
The table is a generalized snapshot of typical support as of May 2026; check the latest release notes for each engine and Iceberg's compatibility tables at the time of adoption.
5.2 Recommended role per engine
- Spark — The standard for backfill, large ETL, and table maintenance. The
system.*procedures are the most complete. - Trino — Interactive analytics and the BI back end. Strong on short queries; MoR application is stable.
- Flink — Streaming ingestion. Strong on exactly-once commits and V2 delete writes.
- Snowflake / Databricks — Self-service and BI for in-house users. The shared-table pattern through the catalog.
- BigQuery / Athena — Reporting and ad-hoc analysis. When you only want queries with no infrastructure to operate.
- PyIceberg — Lightweight ETL, ML training pipelines, and validation in notebooks or locally.
5.3 What the REST Catalog means
Since the REST Catalog standard took hold in 2024–2025, "decoupling engine from catalog so they can be combined freely" became real.
The implications are decisive:
- Break engine lock-in. Moving from one engine to another does not require data migration.
- Concentrate permissions, audits, and policies in one place. The catalog becomes the true control plane.
- The cost of adopting a new engine drops. Every engine implementing the REST spec can immediately work against the same tables.
6. Comparison with Other Formats
6.1 Iceberg vs Delta Lake vs Hudi — the core differences
| Item | Iceberg | Delta Lake | Hudi |
|---|---|---|---|
| Starting point | Multi-engine, metadata-centric | Spark/Databricks-centric, transaction log | Streaming upsert / incremental processing |
| Metadata model | Snapshot + manifest tree | Transaction log (JSON) + checkpoints | Timeline (.hoodie) + metadata |
| Catalog abstraction | First-class (REST spec) | Secondary (Unity is filling the gap) | External dependency |
| Hidden partitioning | ✓ | Limited (generated column) | Partial |
| Partition evolution | ✓ | Limited | Partial |
| Schema evolution | Safe (ID-based) | Safe (name-based) | Safe |
| Row-level delete | V2 delete files / V3 vectors | Deletion vectors | Soft delete + compaction |
| Time travel | ✓ | ✓ | ✓ |
| Branch / tag | ✓ (Git-like) | Limited (time travel only) | Limited |
| Multi-engine maturity | Highest | Databricks-centric, improving externally | Spark/Flink-centric |
| Streaming workloads | Possible and improving | Possible | Most mature |
| Public standard spec | Open and agreed through v3 | Spec public but Databricks-led | Open |
6.2 Decision tree
6.3 Delta-Iceberg interop options
Since 2024, both camps have pursued interop, producing the following options.
- Delta UniForm — Generate Iceberg metadata alongside a Delta table so Iceberg readers can read the Delta table. One-way (Delta → Iceberg read).
- Apache XTable (formerly OneTable) — Translates metadata between Iceberg, Delta, and Hudi. Data is shared; only metadata is expressed in each format.
- Delta exposed via Iceberg REST — Unity Catalog OSS is moving toward exposing Delta tables through the Iceberg REST spec.
Interop options are convenient, but "native spec as-is" is always the most stable. Known limitations of interop modes (e.g., gaps in V2 delete support) must be reviewed carefully before adoption.
7. Adoption Strategy and Migration Patterns
7.1 Which workloads to adopt first
Priority recommendations:
- Long-term retention / legal-correction data — Time travel and branch/tag value materialize immediately.
- Core fact tables that need multi-engine sharing — Once standardized, the impact ripples across the entire in-house analytics infrastructure.
- Marts for new domains — Lowest-risk way to accumulate operations experience.
- Existing Hive core tables — The biggest payoff, but also the biggest migration burden. Approach only after building operations know-how on 1–3.
7.2 Hive → Iceberg migration options
Three standard patterns.
(a) migrate — in-place replacement
CALL system.migrate('hive_db.events');
-- Replace the Hive table's metadata with Iceberg metadata.
-- Data files stay put. The fastest option.- Pros: No data movement, completes in minutes.
- Cons: The old Hive directory structure (partition-key encoding) remains, so you lose some of the hidden-partitioning benefit. Column IDs get assigned to old files; some engines pay extra cost on the first query.
(b) snapshot — shadow table
CALL system.snapshot('hive_db.events', 'iceberg_db.events_v2');
-- Leave the Hive table in place and create an Iceberg table that references the same data files.
-- You can write to both during a validation/comparison window.- Pros: Safe comparison and rollback during operations.
- Cons: You must maintain both sets of metadata in parallel for a while.
(c) CTAS — full rewrite
CREATE TABLE iceberg_db.events
USING iceberg
PARTITIONED BY (days(event_ts))
TBLPROPERTIES ('write.distribution-mode'='hash')
AS SELECT * FROM hive_db.events;- Pros: Apply a fresh partition spec, sort order, compression codec, and file-size policy from the start. Cleanest state.
- Cons: Data is rewritten. Petabyte scale costs time and money.
Recommendation: (c) for core tables intended for long-term operation, (b) when short-term comparison matters, (a) when fast adoption is the priority.
7.3 Delta → Iceberg
Options:
- Use UniForm — Keep Delta as is and additionally generate Iceberg metadata. Cheapest when only reads are needed.
- XTable for two-way metadata translation — Data is shared; metadata is exposed in both formats.
- CTAS, full rewrite — Recommended when you are ready to operate natively on Iceberg.
Operational recommendation: Do not move critical workloads off of Delta immediately. Build 6–12 months of Iceberg operations experience on a new domain or a shadow table, then migrate in stages.
7.4 Phase-by-phase adoption checklist
Phase 0 — Pre-assessment (2–4 weeks)
- Inventory in-house engines, catalogs, and storage
- Pick three candidate workloads (per the priority criteria)
- Decide on a catalog (REST / Glue / Unity / Polaris, etc.)
- Plan the operational-automation jobs
Phase 1 — PoC (4–8 weeks)
- Create an Iceberg version of one candidate table via
snapshotor CTAS - Verify identical results with two engines (e.g., Spark + Trino)
- Run compaction, expiration, and orphan cleanup; collect monitoring metrics
- Apply the WAP pattern and at least one time-travel use case in production
Phase 2 — Operational automation (4–8 weeks)
- Standardize and roll out the five automation jobs from §4.7 organization-wide
- Stabilize the catalog, permission, and audit models
- Agree on monitoring dashboards and alert thresholds
- Document operational runbooks (including failure scenarios)
Phase 3 — Expansion (3–6 months)
- Migrate core tables per the priority list
- Connect the in-house data catalog, BI, and ML pipelines to the new catalog
- Decide whether to adopt Iceberg V3 (assess volatility and maturity)
7.5 Common mistakes in migration
- Putting off the catalog decision — The "let's move the data first" approach turns the catalog into an operations bottleneck. Decide on it first.
- Deferring operational automation — Something that worked nicely in PoC stops six months later under metadata explosion. Build the automation alongside the PoC.
- Picking CoW or MoR uniformly — Ignoring per-table workload characteristics and forcing one mode makes CDC tables slow or analytics tables lose their sort. Decide per table.
- No time-travel SLA — Without a "how far back must we be able to restore" policy, expiration becomes overly conservative and storage cost climbs forever.
- Compaction job blowing up shuffle — One job compacting too wide a range can stall the cluster. Slice the range and time finely and cap per-job resources explicitly.
8. Outlook as of 2026
8.1 Spec evolution
- V3 going mainstream — Variant, Geospatial, and Deletion Vectors should reach GA across major engines by late 2026. The largest gains are in CDC and real-time analytics workloads.
- The rise of Row Lineage — Stable row-level IDs directly serve CDC, feature stores, and reproducible ML training. Combined with data-governance and lineage tooling, this will birth new operational patterns.
- Standardization of materialized views — Spec-level agreement on MVs and aggregation caches over Iceberg will likely reshape the cost structure of analytics workloads again.
8.2 Realignment in the catalog camp
- Apache Polaris, Unity OSS, Lakekeeper, Apache Gravitino compete on a shared REST baseline. With a standardized spec, users see the same interface regardless of backend.
- Commercial vs OSS balance — The next three years will hinge on the choice between "use the catalog as SaaS" and "self-host." The depth of permission, audit, and metadata management will drive cost.
8.3 Changes on the engine side
- Integration with AI/ML workloads — Iceberg's branches/tags and time travel are used to reproduce training data and to synchronize model and data versions. From 2026 on, feature stores and MLOps tooling will more frequently handle Iceberg as first-class.
- Broader native support from OLAP engines — Direct Iceberg-write support in ClickHouse, StarRocks, and DuckDB is maturing fast.
8.4 The control point of data governance shifts
The combination of Iceberg + REST Catalog moves the governance control point from "the engine" to "the catalog." Once data masking, row-level filters, and audit logs are decided at the catalog level, the same policy applies regardless of which engine the user comes through. This makes multi-engine compliance consistent — practically for the first time.
9. Conclusions and Recommendations
9.1 Key messages
- Iceberg is not a mere table format; it is a metadata specification that brings database semantics on top of object storage.
- Its value is only partially visible in a single-engine environment, but becomes decisive in multi-engine, long-term, correction, and experiment-isolation workloads.
- As of 2026, with the combination of the REST Catalog standard + V3 spec, Iceberg is effectively the standard format for multi-engine Lakehouses.
- However, adopting Iceberg without operational automation quickly makes metadata operations cancel out the adoption benefit. Compaction, expiration, orphan cleanup, and monitoring are as important as understanding the spec.
9.2 Adoption recommendations
| Scenario | Recommendation |
|---|---|
| Building a new multi-engine data platform | Adopt Iceberg + REST Catalog as the standard from day one |
| Existing Databricks-only environment considering external engines | Delta UniForm or gradual Iceberg adoption |
| Migrating a Hive-based legacy | Phased adoption starting with priority tables; CTAS recommended |
| Real-time CDC / upsert-heavy | Compare Iceberg V2/V3 (MoR) against Hudi |
| ML / experiment reproducibility matters | Use Iceberg branches, tags, and time travel |
9.3 How Data Dynamics can help
Data Dynamics, the author of this whitepaper, supports Iceberg adoption across the following areas.
- Lakehouse architecture design and review — Recommendations on catalog, engine, and storage topology
- Migration execution — Staged migration from Hive and Delta, from PoC through operational automation
- Operational-automation standardization — Idempotent job design for compaction, expiration, and cleanup; monitoring dashboards
- Catalog operations — Selecting and operating REST Catalog backends (Polaris, Unity OSS, Lakekeeper)
- Engine integration — Consistency validation across Spark, Trino, Flink, Snowflake, Databricks, and BigQuery
If you need a pre-adoption assessment or a technical workshop, contact us and we'll put together an adoption roadmap tailored to your environment.
10. References
- Apache Iceberg official spec — iceberg.apache.org/spec
- Apache Iceberg documentation — iceberg.apache.org/docs
- Iceberg REST Catalog spec —
open-api/rest-catalog-open-api.yamlin the Apache Iceberg repository - Apache Polaris — polaris.apache.org
- Apache XTable — xtable.apache.org
- PyIceberg — py.iceberg.apache.org
- Delta Lake spec — delta.io/protocol
- Apache Hudi docs — hudi.apache.org
- Netflix Tech Blog — Iceberg adoption case studies
- AWS, Snowflake, and Databricks integration guides for Iceberg
- Related on this site: The Complete Delta Lake Guide (Delta-Iceberg comparison)
This whitepaper was written based on information as of May 2026. Iceberg's spec and engine compatibility evolve rapidly; check the latest release notes and compatibility tables at the time of adoption.