Blog
iceberglakehousewhitepaperdata-platformtable-format

Apache Iceberg Whitepaper — Structure and Adoption Strategy for the Next-Generation Lakehouse Table Format

A whitepaper for data architects and platform leaders covering Apache Iceberg's metadata structure, operating model, multi-engine compatibility, and adoption strategy. Spans comparisons with Hive, Delta Lake, and Hudi; migration patterns; operational automation; and the 2026 outlook.

Data DynamicsMay 20, 202630 min read

Scope of this whitepaper

This paper covers Apache Iceberg from two angles: "why adopt it" and "how to operate it." Aimed at data architects, platform leads, and CDOs choosing and adopting table formats in multi-engine Lakehouse environments, it goes beyond the spec to cover operational automation, catalog topology, and migration patterns from a practitioner perspective.


1. Executive Summary

1.1 One-line summary

Apache Iceberg is an open table format that brings database-level transactions and evolution semantics on top of object storage, and as of 2026 it has become the de facto standard for multi-engine Lakehouses.

1.2 Key conclusions

  • As the Lakehouse has moved from "an ACID layer for a single engine" to "shared tables for many engines," the value of Iceberg's engine-neutral specification and the REST Catalog standard has come into sharp focus.
  • Iceberg is not a "storage format" but a "metadata structure." It is a specification that adds snapshot-based isolation, hidden partitioning, schema/partition evolution, time travel, and branches/tags on top of file formats like Parquet, ORC, and Avro.
  • In a single-engine environment (e.g., Databricks-only), Delta Lake is still a reasonable choice, but for workloads with two or more engines in play, or where long-term retention, legal correction, and experiment isolation matter, Iceberg has a clear advantage.
  • Operating costs are not negligible. Without automating compaction, snapshot expiration, orphan file cleanup, and catalog operations, Iceberg quickly turns into a "metadata swamp." This paper presents the standard operational-automation patterns alongside the spec.
  • With Iceberg V3 and the REST Catalog standard consolidating, the catalog layer will become the true control plane of the Lakehouse over the next three years.

1.3 Who benefits, and from which decisions

ReaderWhat this paper provides
Data platform architectsMetadata layering, catalog topology, engine compatibility matrix
Data engineering leadsOperational automation patterns, compaction and cleanup standards, monitoring metrics
CDO / data executivesAdoption decision tree, migration risk and timeline estimates, cost structure
ML / analytics leadsExperimentation and reproducibility patterns using time travel and branches/tags

2. Background: Why Yet Another Table Format

2.1 Structural limits of the Hive era

Through the mid-2010s, the de facto standard across the Hadoop and Spark ecosystem was Hive Metastore (HMS) + directory-based partitioning. It was simple, but its structural limits accumulated as scale grew.

  • Directory = partition = query predicate. Partition pruning only worked when a predicate like WHERE event_date = '2026-05-20' matched the directory path (/event_date=2026-05-20/) exactly. Once a column was renamed or a user wrote an expression like WHERE ts > ..., pruning collapsed.
  • No partition evolution. A decision like "switch from daily to hourly partitions" effectively meant rewriting the table. For a petabyte-scale table in production, that was nearly impossible.
  • Schema evolution by external convention. Hive supported column add/rename to some degree, but position-based mapping without column IDs made column reordering and rename risky.
  • No atomicity. Directory rename was used to fake "atomic publish," but on object stores like S3, rename is implemented as non-atomic copying — yielding partial visibility and duplication.
  • No read-write concurrency. Reading the same partition while another job wrote to it would expose partial files or inconsistent ListBucket results.

2.2 The cloud + object storage shift was the breaking point

S3, ADLS, and GCS differed from HDFS in two decisive ways.

  1. List is expensive and weakly consistent. Hive's structure made tens of thousands of s3 list calls to identify files, and S3 list takes minutes once a prefix carries hundreds of thousands of keys.
  2. Rename is effectively copy + delete. HDFS directory rename was a metadata operation; in S3 it is per-object copying. A job failure mid-run leaves a "half-written" state visible.

In short, Hive was designed on the premise that "the filesystem provides strong consistency and fast renames," and the cloud broke that premise.

2.3 New formats arrive — Iceberg, Delta Lake, Hudi

Three formats arrived nearly simultaneously in this era.

FormatStarted byDesign starting point
Apache Iceberg2017, Netflix"You must be able to know which files belong to a table without listing directories."
Delta Lake2017, Databricks"Extract Spark's transaction log so we can layer ACID on top."
Apache Hudi2016, Uber"Incremental processing optimized for streaming/upsert workloads."

All three solve the problem with a metadata layer, but the structure of that metadata and the operating model differ — which in turn drives the differences in which workloads each format handles best.

2.4 How Iceberg's reception evolved

  • 2018–2020: "Netflix's internal project." Interesting use cases, limited adoption.
  • 2020–2022: Promoted to Apache Top-Level (2020). AWS Athena/EMR, Snowflake, Trino, and Flink added support in turn.
  • 2023–2024: The REST Catalog spec v1 was agreed; Snowflake Polaris and Databricks Unity began handling Iceberg directly, cementing the role of "a shared format across engines."
  • 2025–2026: Iceberg V3 (Variant type, Geospatial, default values, deletion vectors) lands, the spec grows richer, and the REST Catalog becomes the de facto catalog standard.

3. Iceberg Architecture in Depth

3.1 The three-layer model

To guarantee the same answer regardless of which engine reads the table, Iceberg separates table state into three layers.

Iceberg's three-layer model — the catalog holds the metadata.json pointer, the metadata layer owns the snapshot/manifest tree, and the data layer holds the actual Parquet/ORC files

The consequences of this separation are clear:

  • Reading = no directory listing. You always walk the metadata.json → manifest list → manifest → data file tree. You are free from list cost and consistency problems.
  • Writing = create a new metadata.json and atomically swap the pointer in the catalog. Old data files and the old metadata.json remain in place, so other readers don't break.
  • The data files themselves carry no notion of "what shape the table should be." Column names, types, partitions — all of it lives in the metadata layer. That's why schema and partition evolution happen without data rewrite.

3.2 What's actually inside metadata.json

Roughly the following structure (V2-based, simplified).

{
  "format-version": 2,
  "table-uuid": "5f8a...e9",
  "location": "s3://bucket/warehouse/db/events",
  "last-updated-ms": 1747804800000,
  "last-column-id": 12,
  "schemas": [
    {
      "schema-id": 0,
      "fields": [
        { "id": 1, "name": "event_id",   "required": true,  "type": "long" },
        { "id": 2, "name": "user_id",    "required": false, "type": "long" },
        { "id": 3, "name": "event_ts",   "required": true,  "type": "timestamptz" },
        { "id": 4, "name": "event_type", "required": true,  "type": "string" }
      ]
    }
  ],
  "current-schema-id": 0,
  "partition-specs": [
    {
      "spec-id": 0,
      "fields": [
        { "name": "event_day", "source-id": 3, "transform": "day", "field-id": 1000 }
      ]
    }
  ],
  "default-spec-id": 0,
  "sort-orders": [ ... ],
  "current-snapshot-id": 8123412345678901234,
  "snapshots": [
    {
      "snapshot-id": 8123412345678901234,
      "timestamp-ms": 1747804790000,
      "summary": {
        "operation": "append",
        "added-data-files": "3",
        "added-records": "1450000"
      },
      "manifest-list": "s3://.../snap-8123-1-abc.avro",
      "schema-id": 0
    }
  ],
  "refs": {
    "main":  { "snapshot-id": 8123..., "type": "branch" },
    "wap-2026-05-20": { "snapshot-id": 8123..., "type": "tag" }
  }
}

Key observations:

  • The id inside fields is the real identifier of a column. The name can change while the ID stays the same.
  • partition-specs is an array. That is, a table can carry different partition specs over its lifetime.
  • snapshots are cumulative. All of them are kept (until expired) for time travel.
  • refs are Git-like branch/tag references. Beyond main, user-defined branches and tags are treated as first-class.

3.3 The division of labor between manifest list and manifest

To prune files quickly at query time, data file information is summarized in two stages.

The roles of manifest list and manifest — the manifest list holds per-manifest summaries while each manifest carries statistics for individual data and delete files

Query pruning flow (WHERE event_day = '2026-05-20' AND user_id = 42)

  1. Get the current metadata.json location from the catalog.
  2. metadata.json → read the manifest list of the current snapshot.
  3. For each row of the manifest list (= one manifest file), first-pass pruning using the summarized partition ranges. Skip manifests that don't cover event_day = '2026-05-20'.
  4. Read only the surviving manifests. Use each data file's user_id lower/upper bound for second-pass pruning.
  5. Open only the remaining data files.

What this gives you:

  • Even tables with hundreds of thousands of data files only need to examine tens to hundreds per query.
  • No directory listing required, so you are not dependent on S3 list cost or consistency.
  • Statistics (lower/upper/null) are condensed into the manifests, so you don't need to open every Parquet footer.

3.4 Hidden partitioning

One of the biggest pain points of the Hive era was "the partition expression and the query predicate must match exactly." Iceberg solves this by registering a transform on the metadata.

-- DDL: partition event_ts by day (data only stores event_ts)
CREATE TABLE events (
  event_id BIGINT,
  user_id  BIGINT,
  event_ts TIMESTAMP,
  event_type STRING
)
USING iceberg
PARTITIONED BY (days(event_ts));
 
-- Query: you don't need to know the partition column
SELECT count(*)
FROM events
WHERE event_ts >= TIMESTAMP '2026-05-20 00:00:00'
  AND event_ts <  TIMESTAMP '2026-05-21 00:00:00';
  • Only event_ts is stored in the data; the partition key event_day only exists in the metadata.
  • Even when the predicate is on event_ts, the engine understands the partition spec's transform (days(event_ts)) and performs partition pruning automatically.
  • The user does not have to write a condition like event_day = .... The "partition key" is hidden from the user.

Supported transforms: identity, bucket(N, col), truncate(W, col), year, month, day, hour, void.

3.5 Schema evolution

Because Iceberg assigns columns permanent IDs, the following changes happen without rewriting data:

OperationSafetyNotes
Add columnSafeExisting rows are NULL or default
Drop columnSafeThe ID disappears from metadata only; data files are untouched
Rename columnSafeID is preserved; metadata.json updates only the name
Reorder columnSafeOnly the field order in metadata changes
Type wideningPartially safeint → long, float → double, decimal precision increases, etc.
Type narrowingNot allowedLossy conversions like long → int are forbidden
nullable → requiredConditionalMust verify that every existing row is non-null

What makes this possible is field-ID-based mapping in Iceberg, not Parquet's position-based mapping. Column IDs are written alongside the data file, so even if the name changes in metadata, the consistent column is found.

3.6 Partition evolution

Iceberg allows partition specs to be added over time. Patterns like "start daily, switch to hourly once traffic grows" are operationally feasible.

-- Initially: daily partitions
ALTER TABLE events SET TBLPROPERTIES (...);
-- partition spec id = 0 : (day(event_ts))
 
-- Switch to hourly mid-flight
ALTER TABLE events
  REPLACE PARTITION FIELD event_ts WITH hours(event_ts);
-- partition spec id = 1 : (hour(event_ts))

Operational notes:

  • Historical data remains under the old partition spec, and only new writes use the new spec.
  • Two partition specs therefore coexist within one table. Queries work correctly against both, but pruning efficiency differs by spec.
  • If needed, rewrite_data_files can rewrite old data to the new spec (with data-movement cost).

3.7 V1 vs V2 vs V3 — spec evolution

ItemV1V2V3 (2025+)
Standardized2018+2021+2025+
Row-level deleteImpossible (CoW only)Position / Equality delete fileDeletion vectors (Puffin)
Sequence numberNoneYes (essential for consistency)Retained
Column default valueNoNoYes
Variant typeNoNoYes (semi-structured)
Geospatial typeNoNoYes
Row LineageNoNoYes (CDC-friendly)

What V2 brought — row-level deletes:

  • In V1, deleting one row meant rewriting the whole file it lived in (Copy-on-Write, CoW).
  • V2 introduced two kinds of delete file that allow expressing the change as "old file + delete marker" (Merge-on-Read, MoR).
    • Position delete: (file path, row position). Best suited to CDC and MERGE workloads.
    • Equality delete: (column = value). Best for key-based deletes.
  • At read time the engine applies the deletes in memory. Without frequent compaction, reads slow down.

What V3 brings:

  • Deletion Vectors (Puffin format) — A more efficient representation of V2 position deletes. A Roaring bitmap reduces memory and disk usage.
  • Variant type — Stores semi-structured data like JSON with a consistent encoding; engines interpret it the same way.
  • Row Lineage — Assigns a stable ID per row, useful for CDC and ML reproducibility.

3.8 Copy-on-Write vs Merge-on-Read

After V2 the most important operational decision is whether to use CoW or MoR per table.

AspectCopy-on-Write (CoW)Merge-on-Read (MoR)
UPDATE/DELETE behaviorRewrite affected filesKeep old files + add delete files
Write costHigh (large file rewrite)Low (only small delete files)
Read costLow (no delete to apply)High (deletes applied in memory)
Compaction/sort stateMaintained immediatelyDegrades over time, needs compaction
Best forAnalytics-heavy, infrequent correctionsCDC, GDPR corrections, frequent upserts

Operational recommendations:

  • Analytics-heavy tables (e.g., daily aggregates, star-schema facts) — CoW recommended.
  • CDC/MERGE patterns (e.g., real-time user-state tables) — MoR + periodic compaction recommended.
  • Set the mode explicitly with table properties:
    ALTER TABLE events SET TBLPROPERTIES (
      'write.delete.mode'='merge-on-read',
      'write.update.mode'='merge-on-read',
      'write.merge.mode' ='merge-on-read'
    );

3.9 Time travel, branches, tags

Iceberg's snapshot model naturally extends to Git-like data version control.

-- Time travel (specific timestamp)
SELECT * FROM events FOR SYSTEM_TIME AS OF '2026-05-20 09:00:00';
 
-- By snapshot ID directly
SELECT * FROM events VERSION AS OF 8123412345678901234;
 
-- Create a branch (Write-Audit-Publish pattern)
ALTER TABLE events CREATE BRANCH `wap-2026-05-20`;
-- Write changes onto the branch and verify
INSERT INTO events.`wap-2026-05-20` SELECT ...;
-- On success, fast-forward main
ALTER TABLE events FAST FORWARD `main` TO `wap-2026-05-20`;
 
-- Tag (permanent retention point)
ALTER TABLE events CREATE TAG `q1-2026-close`
  AS OF VERSION 8123412345678901234
  RETAIN 365 DAYS;

Patterns:

  • Write-Audit-Publish (WAP) — Write new data to a branch first, merge to main only after quality checks (dbt test, Great Expectations) pass. If validation fails, simply discard the branch — preventing the risk of "even briefly exposing bad data on main."
  • ML experiment isolation — Pin training snapshots with tags (e.g., model-v3-train). Six months later you can retrain on identical data.
  • Legal correction and audit — Tag the pre-correction state so auditors can see "before and after" the correction.

4. Iceberg from an Operations Perspective

4.1 The standard maintenance job set

Operating Iceberg means automating these four maintenance jobs.

JobWhat it doesRecommended frequency
rewrite_data_filesMerges small files into large ones; applies a sort orderDaily–weekly
rewrite_manifestsReorganizes manifests to restore pruning efficiencyWeekly–monthly
expire_snapshotsRemoves old snapshots and files only those referencedDaily
remove_orphan_filesRemoves data/metadata files not referenced by any metadataWeekly–monthly

What happens if you skip them:

  • Small file explosion — A streaming job committing every 10 seconds creates 8,640 files/day. After a month that's 260,000. Each query must read thousands of manifests.
  • Metadata explosion — Tens of thousands of accumulated snapshots make metadata.json grow into tens of MBs, and every write rewrites it in full — commits slow down.
  • Storage cost explosion — Without expiration, a table that updated one row a hundred million times ends up storing tens of times the source volume.

4.2 Compaction design

-- Spark SQL: basic compaction
CALL system.rewrite_data_files(
  table => 'db.events',
  options => map(
    'min-input-files',     '5',
    'target-file-size-bytes','536870912', -- 512 MiB
    'rewrite-all',         'false'
  )
);
 
-- Add a sort order
CALL system.rewrite_data_files(
  table => 'db.events',
  strategy => 'sort',
  sort_order => 'event_ts ASC, user_id ASC'
);

Design principles:

  • target-file-size between 256–1024 MiB. Too small and you pay list/open cost; too large and shuffle/memory pressure grows.
  • Sort by the columns most often used for pruning — maximizes lower/upper bound efficiency.
  • MoR tables also apply delete files in the same pass during compaction, so periodic compaction is the key to maintaining query performance.
  • Job-size control — Don't rewrite every file at once. Compact incrementally by partition or time range. Use the where option to scope.

4.3 Snapshot expiration and cleanup

-- Expire snapshots older than 7 days, keep at least 5
CALL system.expire_snapshots(
  table => 'db.events',
  older_than => TIMESTAMP '2026-05-13 00:00:00',
  retain_last => 5
);
 
-- Remove orphan files not referenced by any snapshot (older than 3 days)
CALL system.remove_orphan_files(
  table => 'db.events',
  older_than => TIMESTAMP '2026-05-17 00:00:00'
);

Operational recommendations:

  • Be conservative with older_than — Long-running jobs (e.g., 6-hour backfills) might still reference old snapshots.
  • remove_orphan_files requires care — A wrong invocation can delete files another job just wrote. Validate with dry_run => true first.
  • Define a time-travel SLA — A policy like "we will not restore data older than 30 days" makes expiration thresholds clear.

4.4 Catalog selection

Iceberg abstracts catalogs, but the catalog you actually pick drives the operating model.

CatalogBest forLimitations
Hive Metastore (HMS)Gradual adoption on top of existing Hive assetsWeak consistency and permission model in multi-engine settings
AWS GlueSingle-cloud AWS with Athena/EMR/Redshift integrationAwkward to use outside AWS
REST CatalogMulti-engine, multi-cloud on a standardized specSelf-hosting / operations burden; need to pick a backend implementation
Project NessieGit-like data versioning (branch/merge)Limited permissioning and SaaS options
Snowflake PolarisMulti-engine sharing in Snowflake-centric environmentsSome Snowflake coupling remains
Databricks Unity CatalogDatabricks-centric environments where UC handles Iceberg as first-classEngines outside UC need separate REST adapters

Recommended pattern: For a new multi-engine environment, put a backend that implements the REST Catalog spec (Apache Polaris, Tabular OSS, Lakekeeper, Apache Gravitino, Unity Catalog OSS) in front, and let Spark, Trino, Flink, BigQuery, and Snowflake all access tables through it.

REST Catalog topology — Spark, Trino, Flink, Snowflake, and BigQuery all access the same tables on the same object storage via a single REST catalog

4.5 Write modes and distribution tuning

ALTER TABLE events SET TBLPROPERTIES (
  'write.distribution-mode'  = 'hash',          -- 'none' | 'hash' | 'range'
  'write.target-file-size-bytes' = '536870912', -- 512 MiB
  'write.parquet.compression-codec' = 'zstd',
  'write.parquet.row-group-size-bytes' = '134217728',
  'commit.retry.num-retries' = '8',
  'commit.retry.min-wait-ms' = '500'
);

Key parameters:

  • write.distribution-mode
    • none — Input distribution preserved. Lots of small files appear.
    • hash — Hash on the partition columns. Generally recommended. File count per partition becomes uniform.
    • range — Sorted distribution. Useful for time-ordered log workloads.
  • commit.retry.* — Retry policy on optimistic concurrency control failures with many writers. Raise the values when conflicts are frequent.

4.6 Monitoring metrics

To know whether an Iceberg table is healthy, watch these metrics regularly.

MetricMeaningRed flag
Average file count per partitionCompaction effectOver 100
Average file sizeCompaction / write distributionUnder 32 MiB
Cumulative snapshot countExpiration policy workingOver 1,000
metadata.json sizeMetadata bloat signalAbove 8 MiB
Average manifest size and countPruning efficiencyOver 5,000 manifests
Average commit latencyCatalog / concurrency issuep95 above 5 s
delete file / data file ratioNeed for MoR compactionAbove 5% → compact

Periodically extract these values from the catalog and metadata and put them on a dashboard — that's standard operations practice. Iceberg's system tables (db.events.files, db.events.snapshots, db.events.manifests) can be used directly.

-- File statistics
SELECT
  partition,
  count(*)              AS file_count,
  avg(file_size_in_bytes) AS avg_size,
  sum(file_size_in_bytes) AS total_size
FROM db.events.files
GROUP BY partition
ORDER BY file_count DESC;
 
-- Snapshot accumulation
SELECT count(*) FROM db.events.snapshots;

4.7 The shape of operational automation

A mature Iceberg operations team needs this set of automated jobs:

  1. Daily compaction jobrewrite_data_files on yesterday's partitions (only partitions whose small-file count crosses the threshold).
  2. Daily expiration job — Expire snapshots older than N days; always keep a fixed number.
  3. Weekly manifest rewrite jobrewrite_manifests.
  4. Monthly orphan cleanup jobremove_orphan_files (dry run → validate → execute).
  5. Table-health report job — Extract the monitoring metrics above and emit dashboards and alerts.

All these jobs must be idempotent and safely retryable on failure. In large environments, it is standard practice to extract this automation into a dedicated "table management service" and operate it alongside the catalog.


5. Engine Compatibility

5.1 Core engine support (as of 2026)

EngineReadWriteDMLTime travelBranch/TagV2 (MoR)V3REST Catalog
Apache SparkIn progress
TrinoPartial
Apache Flink✓ (streaming)PartialPartialIn progress
SnowflakeLimitedIn progress✓ (Polaris)
Databricks (Unity)In progress✓ (UC OSS)
BigQueryPartialPartialPartialIn progress✓ (BigLake)
AWS AthenaPartialIn progress✓ (Glue)
ClickHouseExperimentalPartial
DuckDBExperimentalPartial
PyIcebergPartialIn progress

The table is a generalized snapshot of typical support as of May 2026; check the latest release notes for each engine and Iceberg's compatibility tables at the time of adoption.

  • Spark — The standard for backfill, large ETL, and table maintenance. The system.* procedures are the most complete.
  • Trino — Interactive analytics and the BI back end. Strong on short queries; MoR application is stable.
  • Flink — Streaming ingestion. Strong on exactly-once commits and V2 delete writes.
  • Snowflake / Databricks — Self-service and BI for in-house users. The shared-table pattern through the catalog.
  • BigQuery / Athena — Reporting and ad-hoc analysis. When you only want queries with no infrastructure to operate.
  • PyIceberg — Lightweight ETL, ML training pipelines, and validation in notebooks or locally.

5.3 What the REST Catalog means

Since the REST Catalog standard took hold in 2024–2025, "decoupling engine from catalog so they can be combined freely" became real.

Multiple engines access the same single source of truth (tables) through the REST Catalog

The implications are decisive:

  • Break engine lock-in. Moving from one engine to another does not require data migration.
  • Concentrate permissions, audits, and policies in one place. The catalog becomes the true control plane.
  • The cost of adopting a new engine drops. Every engine implementing the REST spec can immediately work against the same tables.

6. Comparison with Other Formats

6.1 Iceberg vs Delta Lake vs Hudi — the core differences

ItemIcebergDelta LakeHudi
Starting pointMulti-engine, metadata-centricSpark/Databricks-centric, transaction logStreaming upsert / incremental processing
Metadata modelSnapshot + manifest treeTransaction log (JSON) + checkpointsTimeline (.hoodie) + metadata
Catalog abstractionFirst-class (REST spec)Secondary (Unity is filling the gap)External dependency
Hidden partitioningLimited (generated column)Partial
Partition evolutionLimitedPartial
Schema evolutionSafe (ID-based)Safe (name-based)Safe
Row-level deleteV2 delete files / V3 vectorsDeletion vectorsSoft delete + compaction
Time travel
Branch / tag✓ (Git-like)Limited (time travel only)Limited
Multi-engine maturityHighestDatabricks-centric, improving externallySpark/Flink-centric
Streaming workloadsPossible and improvingPossibleMost mature
Public standard specOpen and agreed through v3Spec public but Databricks-ledOpen

6.2 Decision tree

Iceberg vs Delta Lake vs Hudi decision tree — pick a format based on five questions: single engine vs not, multi-engine, CDC share, long-term retention / experiment isolation needs, and Hive-asset migration

6.3 Delta-Iceberg interop options

Since 2024, both camps have pursued interop, producing the following options.

  • Delta UniForm — Generate Iceberg metadata alongside a Delta table so Iceberg readers can read the Delta table. One-way (Delta → Iceberg read).
  • Apache XTable (formerly OneTable) — Translates metadata between Iceberg, Delta, and Hudi. Data is shared; only metadata is expressed in each format.
  • Delta exposed via Iceberg REST — Unity Catalog OSS is moving toward exposing Delta tables through the Iceberg REST spec.

Interop options are convenient, but "native spec as-is" is always the most stable. Known limitations of interop modes (e.g., gaps in V2 delete support) must be reviewed carefully before adoption.


7. Adoption Strategy and Migration Patterns

7.1 Which workloads to adopt first

Priority recommendations:

  1. Long-term retention / legal-correction data — Time travel and branch/tag value materialize immediately.
  2. Core fact tables that need multi-engine sharing — Once standardized, the impact ripples across the entire in-house analytics infrastructure.
  3. Marts for new domains — Lowest-risk way to accumulate operations experience.
  4. Existing Hive core tables — The biggest payoff, but also the biggest migration burden. Approach only after building operations know-how on 1–3.

7.2 Hive → Iceberg migration options

Three standard patterns.

(a) migrate — in-place replacement

CALL system.migrate('hive_db.events');
-- Replace the Hive table's metadata with Iceberg metadata.
-- Data files stay put. The fastest option.
  • Pros: No data movement, completes in minutes.
  • Cons: The old Hive directory structure (partition-key encoding) remains, so you lose some of the hidden-partitioning benefit. Column IDs get assigned to old files; some engines pay extra cost on the first query.

(b) snapshot — shadow table

CALL system.snapshot('hive_db.events', 'iceberg_db.events_v2');
-- Leave the Hive table in place and create an Iceberg table that references the same data files.
-- You can write to both during a validation/comparison window.
  • Pros: Safe comparison and rollback during operations.
  • Cons: You must maintain both sets of metadata in parallel for a while.

(c) CTAS — full rewrite

CREATE TABLE iceberg_db.events
USING iceberg
PARTITIONED BY (days(event_ts))
TBLPROPERTIES ('write.distribution-mode'='hash')
AS SELECT * FROM hive_db.events;
  • Pros: Apply a fresh partition spec, sort order, compression codec, and file-size policy from the start. Cleanest state.
  • Cons: Data is rewritten. Petabyte scale costs time and money.

Recommendation: (c) for core tables intended for long-term operation, (b) when short-term comparison matters, (a) when fast adoption is the priority.

7.3 Delta → Iceberg

Options:

  1. Use UniForm — Keep Delta as is and additionally generate Iceberg metadata. Cheapest when only reads are needed.
  2. XTable for two-way metadata translation — Data is shared; metadata is exposed in both formats.
  3. CTAS, full rewrite — Recommended when you are ready to operate natively on Iceberg.

Operational recommendation: Do not move critical workloads off of Delta immediately. Build 6–12 months of Iceberg operations experience on a new domain or a shadow table, then migrate in stages.

7.4 Phase-by-phase adoption checklist

Phase 0 — Pre-assessment (2–4 weeks)

  • Inventory in-house engines, catalogs, and storage
  • Pick three candidate workloads (per the priority criteria)
  • Decide on a catalog (REST / Glue / Unity / Polaris, etc.)
  • Plan the operational-automation jobs

Phase 1 — PoC (4–8 weeks)

  • Create an Iceberg version of one candidate table via snapshot or CTAS
  • Verify identical results with two engines (e.g., Spark + Trino)
  • Run compaction, expiration, and orphan cleanup; collect monitoring metrics
  • Apply the WAP pattern and at least one time-travel use case in production

Phase 2 — Operational automation (4–8 weeks)

  • Standardize and roll out the five automation jobs from §4.7 organization-wide
  • Stabilize the catalog, permission, and audit models
  • Agree on monitoring dashboards and alert thresholds
  • Document operational runbooks (including failure scenarios)

Phase 3 — Expansion (3–6 months)

  • Migrate core tables per the priority list
  • Connect the in-house data catalog, BI, and ML pipelines to the new catalog
  • Decide whether to adopt Iceberg V3 (assess volatility and maturity)

7.5 Common mistakes in migration

  • Putting off the catalog decision — The "let's move the data first" approach turns the catalog into an operations bottleneck. Decide on it first.
  • Deferring operational automation — Something that worked nicely in PoC stops six months later under metadata explosion. Build the automation alongside the PoC.
  • Picking CoW or MoR uniformly — Ignoring per-table workload characteristics and forcing one mode makes CDC tables slow or analytics tables lose their sort. Decide per table.
  • No time-travel SLA — Without a "how far back must we be able to restore" policy, expiration becomes overly conservative and storage cost climbs forever.
  • Compaction job blowing up shuffle — One job compacting too wide a range can stall the cluster. Slice the range and time finely and cap per-job resources explicitly.

8. Outlook as of 2026

8.1 Spec evolution

  • V3 going mainstream — Variant, Geospatial, and Deletion Vectors should reach GA across major engines by late 2026. The largest gains are in CDC and real-time analytics workloads.
  • The rise of Row Lineage — Stable row-level IDs directly serve CDC, feature stores, and reproducible ML training. Combined with data-governance and lineage tooling, this will birth new operational patterns.
  • Standardization of materialized views — Spec-level agreement on MVs and aggregation caches over Iceberg will likely reshape the cost structure of analytics workloads again.

8.2 Realignment in the catalog camp

  • Apache Polaris, Unity OSS, Lakekeeper, Apache Gravitino compete on a shared REST baseline. With a standardized spec, users see the same interface regardless of backend.
  • Commercial vs OSS balance — The next three years will hinge on the choice between "use the catalog as SaaS" and "self-host." The depth of permission, audit, and metadata management will drive cost.

8.3 Changes on the engine side

  • Integration with AI/ML workloads — Iceberg's branches/tags and time travel are used to reproduce training data and to synchronize model and data versions. From 2026 on, feature stores and MLOps tooling will more frequently handle Iceberg as first-class.
  • Broader native support from OLAP engines — Direct Iceberg-write support in ClickHouse, StarRocks, and DuckDB is maturing fast.

8.4 The control point of data governance shifts

The combination of Iceberg + REST Catalog moves the governance control point from "the engine" to "the catalog." Once data masking, row-level filters, and audit logs are decided at the catalog level, the same policy applies regardless of which engine the user comes through. This makes multi-engine compliance consistent — practically for the first time.


9. Conclusions and Recommendations

9.1 Key messages

  • Iceberg is not a mere table format; it is a metadata specification that brings database semantics on top of object storage.
  • Its value is only partially visible in a single-engine environment, but becomes decisive in multi-engine, long-term, correction, and experiment-isolation workloads.
  • As of 2026, with the combination of the REST Catalog standard + V3 spec, Iceberg is effectively the standard format for multi-engine Lakehouses.
  • However, adopting Iceberg without operational automation quickly makes metadata operations cancel out the adoption benefit. Compaction, expiration, orphan cleanup, and monitoring are as important as understanding the spec.

9.2 Adoption recommendations

ScenarioRecommendation
Building a new multi-engine data platformAdopt Iceberg + REST Catalog as the standard from day one
Existing Databricks-only environment considering external enginesDelta UniForm or gradual Iceberg adoption
Migrating a Hive-based legacyPhased adoption starting with priority tables; CTAS recommended
Real-time CDC / upsert-heavyCompare Iceberg V2/V3 (MoR) against Hudi
ML / experiment reproducibility mattersUse Iceberg branches, tags, and time travel

9.3 How Data Dynamics can help

Data Dynamics, the author of this whitepaper, supports Iceberg adoption across the following areas.

  • Lakehouse architecture design and review — Recommendations on catalog, engine, and storage topology
  • Migration execution — Staged migration from Hive and Delta, from PoC through operational automation
  • Operational-automation standardization — Idempotent job design for compaction, expiration, and cleanup; monitoring dashboards
  • Catalog operations — Selecting and operating REST Catalog backends (Polaris, Unity OSS, Lakekeeper)
  • Engine integration — Consistency validation across Spark, Trino, Flink, Snowflake, Databricks, and BigQuery

If you need a pre-adoption assessment or a technical workshop, contact us and we'll put together an adoption roadmap tailored to your environment.


10. References


This whitepaper was written based on information as of May 2026. Iceberg's spec and engine compatibility evolve rapidly; check the latest release notes and compatibility tables at the time of adoption.