iceberglakehousetable-formatdata-platformsparktrino

Apache Iceberg Complete Guide — Everything About the Next-Gen Lakehouse Table Format

A comprehensive guide covering Apache Iceberg's origins, metadata architecture, core features, catalog options, engine-specific usage, operational automation, performance optimization, and practical adoption strategy.

Data DynamicsMay 23, 202613 min read

This guide targets everyone from data engineers new to Apache Iceberg to architects evaluating Iceberg adoption for an existing platform. It explains why Iceberg emerged, how it works internally, and how to use and operate it from a practical standpoint.

1. What is Apache Iceberg?

1.1 Origins — Limitations of the Hive Table Format

Throughout the 2010s, the de facto standard in big data was Hive Metastore + directory partitioning. This simple, intuitive structure worked well in the HDFS era but began to crack once cloud object stores (S3, ADLS, GCS) became the norm.

Structural problems with the Hive approach:

Directory = partition = query predicate: WHERE event_date='2026-05-23' had to match the directory path exactly for partition pruning to work. Expressions like WHERE ts > ... defeated it.
No partition evolution: Changing "daily partitions to hourly" effectively meant rewriting the entire table.
No atomicity: On S3, rename is implemented as copy + delete, so partial visibility occurs on mid-job failure.
List cost: S3 list calls on prefixes with hundreds of thousands of files are expensive and weakly consistent.
No read-write concurrency: Reading the same partition while it's being written can expose partial files.

To address these limitations, Netflix started Apache Iceberg in 2017.

1.2 Iceberg's Core Value

Iceberg is an open table format that adds database-level transactional and evolution semantics on top of object storage.

Value	Description
ACID transactions	Serializable isolation based on snapshots
Hidden Partitioning	Automatic pruning without requiring users to know partition columns
Schema/Partition Evolution	Change schema and partitioning without rewrites
Time Travel	Query and roll back past snapshots
Branch / Tag	Git-like branches and tags on tables
Engine-neutral	First-class support in Spark, Trino, Flink, Snowflake, BigQuery, etc.

1.3 Position in the Lakehouse Architecture

Loading diagram…

Iceberg is not a "storage format" — it's a metadata specification that overlays transactional and evolution semantics on file formats like Parquet.

1.4 Notable Adopters

Netflix: Original author. Used on a multi-petabyte data platform.
Apple: Core format of its internal data platform.
Airbnb, LinkedIn, Stripe, Pinterest, Adobe: Adopted in multi-engine environments.
Snowflake, AWS, Google Cloud, Databricks: First-class support via external catalogs and external tables.

2. Iceberg Architecture

2.1 Three-Layer Metadata Structure

Iceberg separates metadata into three layers.

Loading diagram…

2.2 Directory Layout Example

warehouse/db/orders/
├── data/
│   ├── 00000-0-...parquet
│   └── 00001-0-...parquet
└── metadata/
    ├── v1.metadata.json
    ├── v2.metadata.json
    ├── snap-3051729675574597004-1-...avro    # manifest list
    └── 0c7a1f8c-...avro                       # manifest file

File	Role
`metadata.json`	Current state: schema, partition spec, sort order, current snapshot pointer
Manifest List (`snap-*.avro`)	List of manifests belonging to a snapshot + partition stats
Manifest File (`*.avro`)	List of data files in one group + per-column stats (min/max, null count, etc.)
Data File (`*.parquet`)	The actual data

2.3 Snapshot-Based Transaction Model

Every write creates a new snapshot. Snapshots are immutable and fully represent the table state at that point in time.

Loading diagram…

Writes are serialized via Optimistic Concurrency Control (OCC). At commit time, the catalog performs an atomic compare-and-swap on the current snapshot pointer. Conflicts trigger a retry.

2.4 A New Dimension of Predicate Pushdown

Query engines use manifest column statistics to prune files before opening them.

Loading diagram…

Hive required list → open all files → read footer. Iceberg reduces this to read one manifest → open only the needed files.

3. Iceberg Core Features

3.1 Schema Evolution

Iceberg assigns a unique ID to every column. Mapping is by ID, not name, so renames and reorders are safe.

Operation	Hive	Iceberg
Add column	OK	OK
Drop column	Risky	Safe
Rename column	Risky	Safe
Reorder column	Risky	Safe
Type promote (int → long)	Partial	Safe

ALTER TABLE orders ADD COLUMN customer_tier STRING;
ALTER TABLE orders RENAME COLUMN amt TO amount;
ALTER TABLE orders ALTER COLUMN amount TYPE BIGINT;

3.2 Hidden Partitioning and Partition Evolution

In Hive, users had to write WHERE event_date='2026-05-23' using the partition column directly. Iceberg stores partition transforms in metadata.

CREATE TABLE orders (
    id BIGINT,
    ts TIMESTAMP,
    amount DECIMAL(10,2)
) USING iceberg
PARTITIONED BY (days(ts));     -- automatically derives day from ts
 
-- Users only need to know ts
SELECT * FROM orders WHERE ts >= '2026-05-23';

Supported transforms: identity, bucket(N, col), truncate(N, col), year, month, day, hour.

Partition Evolution: Change the partitioning strategy while in operation.

ALTER TABLE orders DROP PARTITION FIELD days(ts);
ALTER TABLE orders ADD PARTITION FIELD hours(ts);

Existing files stay as-is; only new writes use the new strategy.

3.3 Time Travel, Tag, and Branch

-- Query by snapshot ID
SELECT * FROM orders VERSION AS OF 3051729675574597004;
 
-- Query by timestamp
SELECT * FROM orders TIMESTAMP AS OF '2026-05-20 00:00:00';
 
-- Create a tag (persistent)
ALTER TABLE orders CREATE TAG `release-2026-Q2`;
 
-- Create a branch (for experiments)
ALTER TABLE orders CREATE BRANCH experimental;

3.4 Row-Level Operations — CoW vs MoR

Mode	Behavior	Write Cost	Read Cost
Copy-on-Write (CoW)	Rewrite affected files entirely	High	Low
Merge-on-Read (MoR)	Add delete files; merge at read time	Low	High

Use MoR for frequent UPDATE/DELETE (e.g., CDC), CoW for analytics-heavy workloads.

Delete types introduced in V2 spec:

Position Delete: "Delete row N in file X"
Equality Delete: "Delete rows where column = value"

V3 introduces Deletion Vectors, significantly improving MoR efficiency.

4. Iceberg Catalog

4.1 Role of the Catalog

The catalog stores the "table name → current metadata.json location" mapping. Because it must provide atomic swap on writes, the catalog is Iceberg's equivalent of a database transaction manager.

4.2 Catalog Options

Catalog	Characteristic	Recommended For
Hive Metastore	Reuses existing HMS	On-prem with heavy Hive footprint
Hadoop (file-based)	No catalog server; file lock	Dev/test only
REST Catalog	Engine-neutral standard API	First choice for new deployments
AWS Glue	AWS-integrated	AWS-native environments
Nessie	Git-like multi-table branching	Data versioning
Snowflake / Polaris	Managed	Sharing Iceberg with Snowflake
Unity Catalog	Databricks-integrated	Sharing Iceberg with Databricks
JDBC	Uses a relational DB	Simple metastore

4.3 The Significance of the REST Catalog Standard

The REST Catalog defines the catalog as a language- and engine-neutral HTTP API. Spark, Trino, Flink, and PyIceberg all use the same API, so catalog implementations are freely interchangeable. As of 2026, REST Catalog is the de facto standard in the Iceberg ecosystem.

5. Using Iceberg

5.1 Spark

-- Spark configuration (REST Catalog)
spark.sql.catalog.demo                = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.demo.type           = rest
spark.sql.catalog.demo.uri            = http://iceberg-rest:8181
spark.sql.catalog.demo.warehouse      = s3://my-warehouse

CREATE TABLE demo.db.orders (
    id BIGINT,
    customer_id BIGINT,
    ts TIMESTAMP,
    amount DECIMAL(10,2)
) USING iceberg
PARTITIONED BY (days(ts));
 
INSERT INTO demo.db.orders VALUES (1, 100, current_timestamp(), 99.99);
 
MERGE INTO demo.db.orders t
USING updates u
ON t.id = u.id
WHEN MATCHED THEN UPDATE SET amount = u.amount
WHEN NOT MATCHED THEN INSERT *;

5.2 Trino

-- Trino catalog config (iceberg.properties)
connector.name      = iceberg
iceberg.catalog.type = rest
iceberg.rest-catalog.uri = http://iceberg-rest:8181

SELECT * FROM iceberg.db.orders WHERE ts >= DATE '2026-05-01';
 
-- Metadata tables
SELECT * FROM iceberg.db."orders$snapshots";
SELECT * FROM iceberg.db."orders$files";
SELECT * FROM iceberg.db."orders$partitions";

5.3 Streaming Ingest with Flink

CREATE TABLE orders_iceberg (
    id BIGINT,
    ts TIMESTAMP(3),
    amount DECIMAL(10,2)
) WITH (
    'connector' = 'iceberg',
    'catalog-type' = 'rest',
    'uri' = 'http://iceberg-rest:8181',
    'warehouse' = 's3://my-warehouse'
);
 
INSERT INTO orders_iceberg
SELECT * FROM kafka_orders;

5.4 PyIceberg

from pyiceberg.catalog import load_catalog
 
catalog = load_catalog(
    "demo",
    **{
        "type": "rest",
        "uri": "http://iceberg-rest:8181",
        "warehouse": "s3://my-warehouse",
    },
)
 
table = catalog.load_table("db.orders")
 
# Read directly into pandas
df = table.scan(
    row_filter="ts >= '2026-05-01'",
    selected_fields=("id", "amount"),
).to_pandas()

6. Operations and Maintenance

6.1 Compaction — The Small File Problem

Streaming ingest accumulates small files that crater query performance. Regular compaction is essential.

-- Spark Action
CALL demo.system.rewrite_data_files(
    table => 'db.orders',
    options => map('target-file-size-bytes', '536870912')   -- 512MB
);

6.2 Snapshot Expiration

Snapshots accumulate, bloating metadata and preventing data files from being GC'd.

CALL demo.system.expire_snapshots(
    table => 'db.orders',
    older_than => TIMESTAMP '2026-05-01 00:00:00',
    retain_last => 10
);

6.3 Orphan File Cleanup

Failed writes can leave behind files not referenced by metadata.

CALL demo.system.remove_orphan_files(
    table => 'db.orders',
    older_than => TIMESTAMP '2026-05-01 00:00:00'
);

6.4 Metadata Rewrite

Too many manifests inflate plan time.

CALL demo.system.rewrite_manifests(table => 'db.orders');

6.5 Standard Operational Automation Pattern

Task	Cadence	Tooling
`rewrite_data_files`	Hourly/daily	Spark Action, Airflow
`expire_snapshots`	Daily	Spark Action
`remove_orphan_files`	Weekly	Spark Action
`rewrite_manifests`	As needed	Spark Action
Metrics collection	Per-minute	Catalog API + Prometheus

Without operational automation, Iceberg quickly devolves into a "metadata swamp."

7. Iceberg Spec Evolution (V1 → V2 → V3)

Version	Release	Key Features
V1	2019	Snapshots, schema/partition evolution, hidden partitioning
V2	2021	Row-level Delete (position/equality), MoR
V3	2025+	Deletion Vectors, Variant type, Geospatial, default values, row lineage

V3's Deletion Vectors replace position delete files with compact bitmaps, dramatically improving MoR read performance.

Compatibility notes:

Engines that only support older versions cannot read newer-version tables.
V2 → V3 upgrade is one-way.
Verify all client (Spark, Trino, Flink) versions in advance.

8. Comparison with Delta Lake / Hudi

Aspect	Iceberg	Delta Lake	Hudi
Original Author	Netflix	Databricks	Uber
Transaction Log	Manifest tree	JSON log	Timeline
Hidden Partitioning	OK	Partial	Partial
Partition Evolution	OK	No	No
Schema Evolution	Strong	Strong	Strong
CoW / MoR	Both	CoW (DV added)	Both
Branch / Tag	OK	No	No
REST Catalog standard	OK	UC-coupled	No
Single-engine fit	Decent	Excellent (Databricks)	Decent
Multi-engine fit	Excellent	Decent (UniForm)	Decent

Decision criteria:

Databricks-only environment → Delta Lake
Frequent upsert + streaming → Hudi
Multi-engine, long retention, evolution semantics → Iceberg

With Delta Lake's UniForm and Iceberg's XTable style interop layers, format choice will matter less over time.

9. Performance Best Practices

9.1 File Size

Files that are too small (<100MB) inflate plan overhead; files that are too large (>2GB) reduce parallelism. Aim for 256MB–1GB.

ALTER TABLE orders SET TBLPROPERTIES (
  'write.target-file-size-bytes' = '536870912'
);

9.2 Partitioning Strategy

Don't partition by extremely high-cardinality columns (e.g., user_id).
Almost always partition time columns with day or hour transforms.
Use bucket(N, col) for high-cardinality columns.

9.3 Sort Order / Z-Order

Sorting in line with query patterns maximizes manifest pruning effectiveness.

ALTER TABLE orders WRITE ORDERED BY ts, customer_id;

9.4 Manifest Sharding

For very large tables, split manifests by partition to reduce plan cost. Use the rewrite_manifests action.

9.5 Anti-Patterns

Not running snapshot expiration → metadata explosion
Not running compaction → hundreds of thousands of small files
Committing too frequently → manifest explosion
Too many partition columns → list cost spikes
Using Hadoop Catalog in production → commit conflicts and data loss risk

10. Practical Adoption Guide

10.1 Migrating Existing Hive Tables

Two approaches.

Approach	Behavior	Cost	Rollback
Snapshot (`snapshot` procedure)	Keep the original Hive table; create an Iceberg shadow	Low	Easy
Migrate (`migrate` procedure)	Convert the original in-place to Iceberg	Low	Hard

-- Approach 1: Validation shadow
CALL demo.system.snapshot('hive_db.orders', 'demo.db.orders');
 
-- Approach 2: Full cutover after validation
CALL demo.system.migrate('hive_db.orders');

10.2 Building a CDC Pipeline

Loading diagram…

Key points:

Flink Iceberg sink in upsert mode uses equality deletes
Minute-level commits (not too frequent)
Nightly batch compaction + snapshot expiration

10.3 Governance and Security

AWS Lake Formation: Row/column-level permissions on Glue Catalog + Iceberg
Apache Ranger: Policy-based authorization for Hive/Trino + Iceberg
Unity Catalog: Governs Iceberg external tables in Databricks

10.4 Cost Optimization

Tune compaction file size to cut list/get call counts
Differentiate snapshot retention by workload (time-series vs master)
S3 Intelligent-Tiering for aging snapshot data
Manifest caching (Spark, Trino) to reduce plan time

11. Wrap-Up

11.1 When to Choose Iceberg

If two or more of the following apply, Iceberg is a clear winner.

You use two or more query engines simultaneously (Spark + Trino, Flink + Snowflake, etc.)
You need time travel and legal corrections on long-retention data
You may change partition strategy in the future
You need branches/tags for experiments and reproducibility
You want to avoid lock-in to a specific vendor (e.g., Databricks)

11.2 The Future of the Iceberg Ecosystem

REST Catalog standardization: The catalog is becoming the true control plane of the Lakehouse.
V3 adoption: Deletion Vectors, Variant, and Geospatial expand the addressable workloads.
Interoperability: XTable and UniForm blur the lines between Iceberg ↔ Delta ↔ Hudi.
Managed catalog competition: Snowflake Polaris, Databricks Unity, AWS Glue, and Tabular's successors form a new market.

11.3 Learning Resources

Iceberg is no longer a "new option" — it is the de facto standard for multi-engine Lakehouses, sitting at the center of the shift that makes the catalog layer the control plane of the data platform. Beyond the decision to adopt, invest more thought in operational automation and catalog strategy.