Blog
iceberglakehousetable-formatdata-platformsparktrino

Apache Iceberg Complete Guide — Everything About the Next-Gen Lakehouse Table Format

A comprehensive guide covering Apache Iceberg's origins, metadata architecture, core features, catalog options, engine-specific usage, operational automation, performance optimization, and practical adoption strategy.

Data DynamicsMay 23, 202613 min read

This guide targets everyone from data engineers new to Apache Iceberg to architects evaluating Iceberg adoption for an existing platform. It explains why Iceberg emerged, how it works internally, and how to use and operate it from a practical standpoint.


1. What is Apache Iceberg?

1.1 Origins — Limitations of the Hive Table Format

Throughout the 2010s, the de facto standard in big data was Hive Metastore + directory partitioning. This simple, intuitive structure worked well in the HDFS era but began to crack once cloud object stores (S3, ADLS, GCS) became the norm.

Structural problems with the Hive approach:

  • Directory = partition = query predicate: WHERE event_date='2026-05-23' had to match the directory path exactly for partition pruning to work. Expressions like WHERE ts > ... defeated it.
  • No partition evolution: Changing "daily partitions to hourly" effectively meant rewriting the entire table.
  • No atomicity: On S3, rename is implemented as copy + delete, so partial visibility occurs on mid-job failure.
  • List cost: S3 list calls on prefixes with hundreds of thousands of files are expensive and weakly consistent.
  • No read-write concurrency: Reading the same partition while it's being written can expose partial files.

To address these limitations, Netflix started Apache Iceberg in 2017.

1.2 Iceberg's Core Value

Iceberg is an open table format that adds database-level transactional and evolution semantics on top of object storage.

ValueDescription
ACID transactionsSerializable isolation based on snapshots
Hidden PartitioningAutomatic pruning without requiring users to know partition columns
Schema/Partition EvolutionChange schema and partitioning without rewrites
Time TravelQuery and roll back past snapshots
Branch / TagGit-like branches and tags on tables
Engine-neutralFirst-class support in Spark, Trino, Flink, Snowflake, BigQuery, etc.

1.3 Position in the Lakehouse Architecture

┌─────────────────────────────────────────────────┐
│   Spark │ Trino │ Flink │ Snowflake │ BigQuery  │  ← Multi-engine
├─────────────────────────────────────────────────┤
│              Iceberg Catalog (REST)             │  ← Control plane
├─────────────────────────────────────────────────┤
│   Iceberg Metadata (Snapshot / Manifest / ...)  │  ← Table format
├─────────────────────────────────────────────────┤
│       Parquet / ORC / Avro (Data Files)         │  ← File format
├─────────────────────────────────────────────────┤
│         Object Storage (S3 / ADLS / GCS)        │  ← Storage
└─────────────────────────────────────────────────┘

Iceberg is not a "storage format" — it's a metadata specification that overlays transactional and evolution semantics on file formats like Parquet.

1.4 Notable Adopters

  • Netflix: Original author. Used on a multi-petabyte data platform.
  • Apple: Core format of its internal data platform.
  • Airbnb, LinkedIn, Stripe, Pinterest, Adobe: Adopted in multi-engine environments.
  • Snowflake, AWS, Google Cloud, Databricks: First-class support via external catalogs and external tables.

2. Iceberg Architecture

2.1 Three-Layer Metadata Structure

Iceberg separates metadata into three layers.

Catalog (Hive / REST / Glue / Nessie ...)
    │
    └─▶ metadata.json (current snapshot, schema, partition spec)
            │
            └─▶ manifest list (snap-*.avro)
                    │
                    └─▶ manifest file (*.avro)
                            │
                            └─▶ data file (*.parquet)

2.2 Directory Layout Example

warehouse/db/orders/
├── data/
│   ├── 00000-0-...parquet
│   └── 00001-0-...parquet
└── metadata/
    ├── v1.metadata.json
    ├── v2.metadata.json
    ├── snap-3051729675574597004-1-...avro    # manifest list
    └── 0c7a1f8c-...avro                       # manifest file
FileRole
metadata.jsonCurrent state: schema, partition spec, sort order, current snapshot pointer
Manifest List (snap-*.avro)List of manifests belonging to a snapshot + partition stats
Manifest File (*.avro)List of data files in one group + per-column stats (min/max, null count, etc.)
Data File (*.parquet)The actual data

2.3 Snapshot-Based Transaction Model

Every write creates a new snapshot. Snapshots are immutable and fully represent the table state at that point in time.

Snapshot 0 (table created)
    │
    ├─▶ INSERT  → Snapshot 1
    │
    ├─▶ UPDATE  → Snapshot 2
    │
    └─▶ DELETE  → Snapshot 3 (current)

Writes are serialized via Optimistic Concurrency Control (OCC). At commit time, the catalog performs an atomic compare-and-swap on the current snapshot pointer. Conflicts trigger a retry.

2.4 A New Dimension of Predicate Pushdown

Query engines use manifest column statistics to prune files before opening them.

Query: SELECT * FROM orders WHERE order_date = '2026-05-23'

1. Scan manifest list → narrow to 1 manifest via partition stats
2. Scan manifest      → narrow to 5 data files via column min/max
3. Open only those 5 files

Hive required list → open all files → read footer. Iceberg reduces this to read one manifest → open only the needed files.


3. Iceberg Core Features

3.1 Schema Evolution

Iceberg assigns a unique ID to every column. Mapping is by ID, not name, so renames and reorders are safe.

OperationHiveIceberg
Add columnOKOK
Drop columnRiskySafe
Rename columnRiskySafe
Reorder columnRiskySafe
Type promote (int → long)PartialSafe
ALTER TABLE orders ADD COLUMN customer_tier STRING;
ALTER TABLE orders RENAME COLUMN amt TO amount;
ALTER TABLE orders ALTER COLUMN amount TYPE BIGINT;

3.2 Hidden Partitioning and Partition Evolution

In Hive, users had to write WHERE event_date='2026-05-23' using the partition column directly. Iceberg stores partition transforms in metadata.

CREATE TABLE orders (
    id BIGINT,
    ts TIMESTAMP,
    amount DECIMAL(10,2)
) USING iceberg
PARTITIONED BY (days(ts));     -- automatically derives day from ts
 
-- Users only need to know ts
SELECT * FROM orders WHERE ts >= '2026-05-23';

Supported transforms: identity, bucket(N, col), truncate(N, col), year, month, day, hour.

Partition Evolution: Change the partitioning strategy while in operation.

ALTER TABLE orders DROP PARTITION FIELD days(ts);
ALTER TABLE orders ADD PARTITION FIELD hours(ts);

Existing files stay as-is; only new writes use the new strategy.

3.3 Time Travel, Tag, and Branch

-- Query by snapshot ID
SELECT * FROM orders VERSION AS OF 3051729675574597004;
 
-- Query by timestamp
SELECT * FROM orders TIMESTAMP AS OF '2026-05-20 00:00:00';
 
-- Create a tag (persistent)
ALTER TABLE orders CREATE TAG `release-2026-Q2`;
 
-- Create a branch (for experiments)
ALTER TABLE orders CREATE BRANCH experimental;

3.4 Row-Level Operations — CoW vs MoR

ModeBehaviorWrite CostRead Cost
Copy-on-Write (CoW)Rewrite affected files entirelyHighLow
Merge-on-Read (MoR)Add delete files; merge at read timeLowHigh

Use MoR for frequent UPDATE/DELETE (e.g., CDC), CoW for analytics-heavy workloads.

Delete types introduced in V2 spec:

  • Position Delete: "Delete row N in file X"
  • Equality Delete: "Delete rows where column = value"

V3 introduces Deletion Vectors, significantly improving MoR efficiency.


4. Iceberg Catalog

4.1 Role of the Catalog

The catalog stores the "table name → current metadata.json location" mapping. Because it must provide atomic swap on writes, the catalog is Iceberg's equivalent of a database transaction manager.

4.2 Catalog Options

CatalogCharacteristicRecommended For
Hive MetastoreReuses existing HMSOn-prem with heavy Hive footprint
Hadoop (file-based)No catalog server; file lockDev/test only
REST CatalogEngine-neutral standard APIFirst choice for new deployments
AWS GlueAWS-integratedAWS-native environments
NessieGit-like multi-table branchingData versioning
Snowflake / PolarisManagedSharing Iceberg with Snowflake
Unity CatalogDatabricks-integratedSharing Iceberg with Databricks
JDBCUses a relational DBSimple metastore

4.3 The Significance of the REST Catalog Standard

The REST Catalog defines the catalog as a language- and engine-neutral HTTP API. Spark, Trino, Flink, and PyIceberg all use the same API, so catalog implementations are freely interchangeable. As of 2026, REST Catalog is the de facto standard in the Iceberg ecosystem.


5. Using Iceberg

5.1 Spark

-- Spark configuration (REST Catalog)
spark.sql.catalog.demo                = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.demo.type           = rest
spark.sql.catalog.demo.uri            = http://iceberg-rest:8181
spark.sql.catalog.demo.warehouse      = s3://my-warehouse
CREATE TABLE demo.db.orders (
    id BIGINT,
    customer_id BIGINT,
    ts TIMESTAMP,
    amount DECIMAL(10,2)
) USING iceberg
PARTITIONED BY (days(ts));
 
INSERT INTO demo.db.orders VALUES (1, 100, current_timestamp(), 99.99);
 
MERGE INTO demo.db.orders t
USING updates u
ON t.id = u.id
WHEN MATCHED THEN UPDATE SET amount = u.amount
WHEN NOT MATCHED THEN INSERT *;

5.2 Trino

-- Trino catalog config (iceberg.properties)
connector.name      = iceberg
iceberg.catalog.type = rest
iceberg.rest-catalog.uri = http://iceberg-rest:8181
SELECT * FROM iceberg.db.orders WHERE ts >= DATE '2026-05-01';
 
-- Metadata tables
SELECT * FROM iceberg.db."orders$snapshots";
SELECT * FROM iceberg.db."orders$files";
SELECT * FROM iceberg.db."orders$partitions";
CREATE TABLE orders_iceberg (
    id BIGINT,
    ts TIMESTAMP(3),
    amount DECIMAL(10,2)
) WITH (
    'connector' = 'iceberg',
    'catalog-type' = 'rest',
    'uri' = 'http://iceberg-rest:8181',
    'warehouse' = 's3://my-warehouse'
);
 
INSERT INTO orders_iceberg
SELECT * FROM kafka_orders;

5.4 PyIceberg

from pyiceberg.catalog import load_catalog
 
catalog = load_catalog(
    "demo",
    **{
        "type": "rest",
        "uri": "http://iceberg-rest:8181",
        "warehouse": "s3://my-warehouse",
    },
)
 
table = catalog.load_table("db.orders")
 
# Read directly into pandas
df = table.scan(
    row_filter="ts >= '2026-05-01'",
    selected_fields=("id", "amount"),
).to_pandas()

6. Operations and Maintenance

6.1 Compaction — The Small File Problem

Streaming ingest accumulates small files that crater query performance. Regular compaction is essential.

-- Spark Action
CALL demo.system.rewrite_data_files(
    table => 'db.orders',
    options => map('target-file-size-bytes', '536870912')   -- 512MB
);

6.2 Snapshot Expiration

Snapshots accumulate, bloating metadata and preventing data files from being GC'd.

CALL demo.system.expire_snapshots(
    table => 'db.orders',
    older_than => TIMESTAMP '2026-05-01 00:00:00',
    retain_last => 10
);

6.3 Orphan File Cleanup

Failed writes can leave behind files not referenced by metadata.

CALL demo.system.remove_orphan_files(
    table => 'db.orders',
    older_than => TIMESTAMP '2026-05-01 00:00:00'
);

6.4 Metadata Rewrite

Too many manifests inflate plan time.

CALL demo.system.rewrite_manifests(table => 'db.orders');

6.5 Standard Operational Automation Pattern

TaskCadenceTooling
rewrite_data_filesHourly/dailySpark Action, Airflow
expire_snapshotsDailySpark Action
remove_orphan_filesWeeklySpark Action
rewrite_manifestsAs neededSpark Action
Metrics collectionPer-minuteCatalog API + Prometheus

Without operational automation, Iceberg quickly devolves into a "metadata swamp."


7. Iceberg Spec Evolution (V1 → V2 → V3)

VersionReleaseKey Features
V12019Snapshots, schema/partition evolution, hidden partitioning
V22021Row-level Delete (position/equality), MoR
V32025+Deletion Vectors, Variant type, Geospatial, default values, row lineage

V3's Deletion Vectors replace position delete files with compact bitmaps, dramatically improving MoR read performance.

Compatibility notes:

  • Engines that only support older versions cannot read newer-version tables.
  • V2 → V3 upgrade is one-way.
  • Verify all client (Spark, Trino, Flink) versions in advance.

8. Comparison with Delta Lake / Hudi

AspectIcebergDelta LakeHudi
Original AuthorNetflixDatabricksUber
Transaction LogManifest treeJSON logTimeline
Hidden PartitioningOKPartialPartial
Partition EvolutionOKNoNo
Schema EvolutionStrongStrongStrong
CoW / MoRBothCoW (DV added)Both
Branch / TagOKNoNo
REST Catalog standardOKUC-coupledNo
Single-engine fitDecentExcellent (Databricks)Decent
Multi-engine fitExcellentDecent (UniForm)Decent

Decision criteria:

  • Databricks-only environment → Delta Lake
  • Frequent upsert + streaming → Hudi
  • Multi-engine, long retention, evolution semantics → Iceberg

With Delta Lake's UniForm and Iceberg's XTable style interop layers, format choice will matter less over time.


9. Performance Best Practices

9.1 File Size

Files that are too small (<100MB) inflate plan overhead; files that are too large (>2GB) reduce parallelism. Aim for 256MB–1GB.

ALTER TABLE orders SET TBLPROPERTIES (
  'write.target-file-size-bytes' = '536870912'
);

9.2 Partitioning Strategy

  • Don't partition by extremely high-cardinality columns (e.g., user_id).
  • Almost always partition time columns with day or hour transforms.
  • Use bucket(N, col) for high-cardinality columns.

9.3 Sort Order / Z-Order

Sorting in line with query patterns maximizes manifest pruning effectiveness.

ALTER TABLE orders WRITE ORDERED BY ts, customer_id;

9.4 Manifest Sharding

For very large tables, split manifests by partition to reduce plan cost. Use the rewrite_manifests action.

9.5 Anti-Patterns

  • Not running snapshot expiration → metadata explosion
  • Not running compaction → hundreds of thousands of small files
  • Committing too frequently → manifest explosion
  • Too many partition columns → list cost spikes
  • Using Hadoop Catalog in production → commit conflicts and data loss risk

10. Practical Adoption Guide

10.1 Migrating Existing Hive Tables

Two approaches.

ApproachBehaviorCostRollback
Snapshot (snapshot procedure)Keep the original Hive table; create an Iceberg shadowLowEasy
Migrate (migrate procedure)Convert the original in-place to IcebergLowHard
-- Approach 1: Validation shadow
CALL demo.system.snapshot('hive_db.orders', 'demo.db.orders');
 
-- Approach 2: Full cutover after validation
CALL demo.system.migrate('hive_db.orders');

10.2 Building a CDC Pipeline

RDB → Debezium → Kafka → Flink → Iceberg (MoR)
                                      │
                                      └─▶ Compaction (batch)

Key points:

  • Flink Iceberg sink in upsert mode uses equality deletes
  • Minute-level commits (not too frequent)
  • Nightly batch compaction + snapshot expiration

10.3 Governance and Security

  • AWS Lake Formation: Row/column-level permissions on Glue Catalog + Iceberg
  • Apache Ranger: Policy-based authorization for Hive/Trino + Iceberg
  • Unity Catalog: Governs Iceberg external tables in Databricks

10.4 Cost Optimization

  • Tune compaction file size to cut list/get call counts
  • Differentiate snapshot retention by workload (time-series vs master)
  • S3 Intelligent-Tiering for aging snapshot data
  • Manifest caching (Spark, Trino) to reduce plan time

11. Wrap-Up

11.1 When to Choose Iceberg

If two or more of the following apply, Iceberg is a clear winner.

  • You use two or more query engines simultaneously (Spark + Trino, Flink + Snowflake, etc.)
  • You need time travel and legal corrections on long-retention data
  • You may change partition strategy in the future
  • You need branches/tags for experiments and reproducibility
  • You want to avoid lock-in to a specific vendor (e.g., Databricks)

11.2 The Future of the Iceberg Ecosystem

  • REST Catalog standardization: The catalog is becoming the true control plane of the Lakehouse.
  • V3 adoption: Deletion Vectors, Variant, and Geospatial expand the addressable workloads.
  • Interoperability: XTable and UniForm blur the lines between Iceberg ↔ Delta ↔ Hudi.
  • Managed catalog competition: Snowflake Polaris, Databricks Unity, AWS Glue, and Tabular's successors form a new market.

11.3 Learning Resources

Iceberg is no longer a "new option" — it is the de facto standard for multi-engine Lakehouses, sitting at the center of the shift that makes the catalog layer the control plane of the data platform. Beyond the decision to adopt, invest more thought in operational automation and catalog strategy.