Apache Iceberg Complete Guide — Everything About the Next-Gen Lakehouse Table Format
A comprehensive guide covering Apache Iceberg's origins, metadata architecture, core features, catalog options, engine-specific usage, operational automation, performance optimization, and practical adoption strategy.
This guide targets everyone from data engineers new to Apache Iceberg to architects evaluating Iceberg adoption for an existing platform. It explains why Iceberg emerged, how it works internally, and how to use and operate it from a practical standpoint.
1. What is Apache Iceberg?
1.1 Origins — Limitations of the Hive Table Format
Throughout the 2010s, the de facto standard in big data was Hive Metastore + directory partitioning. This simple, intuitive structure worked well in the HDFS era but began to crack once cloud object stores (S3, ADLS, GCS) became the norm.
Structural problems with the Hive approach:
- Directory = partition = query predicate:
WHERE event_date='2026-05-23'had to match the directory path exactly for partition pruning to work. Expressions likeWHERE ts > ...defeated it. - No partition evolution: Changing "daily partitions to hourly" effectively meant rewriting the entire table.
- No atomicity: On S3, rename is implemented as copy + delete, so partial visibility occurs on mid-job failure.
- List cost: S3 list calls on prefixes with hundreds of thousands of files are expensive and weakly consistent.
- No read-write concurrency: Reading the same partition while it's being written can expose partial files.
To address these limitations, Netflix started Apache Iceberg in 2017.
1.2 Iceberg's Core Value
Iceberg is an open table format that adds database-level transactional and evolution semantics on top of object storage.
| Value | Description |
|---|---|
| ACID transactions | Serializable isolation based on snapshots |
| Hidden Partitioning | Automatic pruning without requiring users to know partition columns |
| Schema/Partition Evolution | Change schema and partitioning without rewrites |
| Time Travel | Query and roll back past snapshots |
| Branch / Tag | Git-like branches and tags on tables |
| Engine-neutral | First-class support in Spark, Trino, Flink, Snowflake, BigQuery, etc. |
1.3 Position in the Lakehouse Architecture
┌─────────────────────────────────────────────────┐
│ Spark │ Trino │ Flink │ Snowflake │ BigQuery │ ← Multi-engine
├─────────────────────────────────────────────────┤
│ Iceberg Catalog (REST) │ ← Control plane
├─────────────────────────────────────────────────┤
│ Iceberg Metadata (Snapshot / Manifest / ...) │ ← Table format
├─────────────────────────────────────────────────┤
│ Parquet / ORC / Avro (Data Files) │ ← File format
├─────────────────────────────────────────────────┤
│ Object Storage (S3 / ADLS / GCS) │ ← Storage
└─────────────────────────────────────────────────┘
Iceberg is not a "storage format" — it's a metadata specification that overlays transactional and evolution semantics on file formats like Parquet.
1.4 Notable Adopters
- Netflix: Original author. Used on a multi-petabyte data platform.
- Apple: Core format of its internal data platform.
- Airbnb, LinkedIn, Stripe, Pinterest, Adobe: Adopted in multi-engine environments.
- Snowflake, AWS, Google Cloud, Databricks: First-class support via external catalogs and external tables.
2. Iceberg Architecture
2.1 Three-Layer Metadata Structure
Iceberg separates metadata into three layers.
Catalog (Hive / REST / Glue / Nessie ...)
│
└─▶ metadata.json (current snapshot, schema, partition spec)
│
└─▶ manifest list (snap-*.avro)
│
└─▶ manifest file (*.avro)
│
└─▶ data file (*.parquet)
2.2 Directory Layout Example
warehouse/db/orders/
├── data/
│ ├── 00000-0-...parquet
│ └── 00001-0-...parquet
└── metadata/
├── v1.metadata.json
├── v2.metadata.json
├── snap-3051729675574597004-1-...avro # manifest list
└── 0c7a1f8c-...avro # manifest file
| File | Role |
|---|---|
metadata.json | Current state: schema, partition spec, sort order, current snapshot pointer |
Manifest List (snap-*.avro) | List of manifests belonging to a snapshot + partition stats |
Manifest File (*.avro) | List of data files in one group + per-column stats (min/max, null count, etc.) |
Data File (*.parquet) | The actual data |
2.3 Snapshot-Based Transaction Model
Every write creates a new snapshot. Snapshots are immutable and fully represent the table state at that point in time.
Snapshot 0 (table created)
│
├─▶ INSERT → Snapshot 1
│
├─▶ UPDATE → Snapshot 2
│
└─▶ DELETE → Snapshot 3 (current)
Writes are serialized via Optimistic Concurrency Control (OCC). At commit time, the catalog performs an atomic compare-and-swap on the current snapshot pointer. Conflicts trigger a retry.
2.4 A New Dimension of Predicate Pushdown
Query engines use manifest column statistics to prune files before opening them.
Query: SELECT * FROM orders WHERE order_date = '2026-05-23'
1. Scan manifest list → narrow to 1 manifest via partition stats
2. Scan manifest → narrow to 5 data files via column min/max
3. Open only those 5 files
Hive required list → open all files → read footer. Iceberg reduces this to read one manifest → open only the needed files.
3. Iceberg Core Features
3.1 Schema Evolution
Iceberg assigns a unique ID to every column. Mapping is by ID, not name, so renames and reorders are safe.
| Operation | Hive | Iceberg |
|---|---|---|
| Add column | OK | OK |
| Drop column | Risky | Safe |
| Rename column | Risky | Safe |
| Reorder column | Risky | Safe |
| Type promote (int → long) | Partial | Safe |
ALTER TABLE orders ADD COLUMN customer_tier STRING;
ALTER TABLE orders RENAME COLUMN amt TO amount;
ALTER TABLE orders ALTER COLUMN amount TYPE BIGINT;3.2 Hidden Partitioning and Partition Evolution
In Hive, users had to write WHERE event_date='2026-05-23' using the partition column directly. Iceberg stores partition transforms in metadata.
CREATE TABLE orders (
id BIGINT,
ts TIMESTAMP,
amount DECIMAL(10,2)
) USING iceberg
PARTITIONED BY (days(ts)); -- automatically derives day from ts
-- Users only need to know ts
SELECT * FROM orders WHERE ts >= '2026-05-23';Supported transforms: identity, bucket(N, col), truncate(N, col), year, month, day, hour.
Partition Evolution: Change the partitioning strategy while in operation.
ALTER TABLE orders DROP PARTITION FIELD days(ts);
ALTER TABLE orders ADD PARTITION FIELD hours(ts);Existing files stay as-is; only new writes use the new strategy.
3.3 Time Travel, Tag, and Branch
-- Query by snapshot ID
SELECT * FROM orders VERSION AS OF 3051729675574597004;
-- Query by timestamp
SELECT * FROM orders TIMESTAMP AS OF '2026-05-20 00:00:00';
-- Create a tag (persistent)
ALTER TABLE orders CREATE TAG `release-2026-Q2`;
-- Create a branch (for experiments)
ALTER TABLE orders CREATE BRANCH experimental;3.4 Row-Level Operations — CoW vs MoR
| Mode | Behavior | Write Cost | Read Cost |
|---|---|---|---|
| Copy-on-Write (CoW) | Rewrite affected files entirely | High | Low |
| Merge-on-Read (MoR) | Add delete files; merge at read time | Low | High |
Use MoR for frequent UPDATE/DELETE (e.g., CDC), CoW for analytics-heavy workloads.
Delete types introduced in V2 spec:
- Position Delete: "Delete row N in file X"
- Equality Delete: "Delete rows where column = value"
V3 introduces Deletion Vectors, significantly improving MoR efficiency.
4. Iceberg Catalog
4.1 Role of the Catalog
The catalog stores the "table name → current metadata.json location" mapping. Because it must provide atomic swap on writes, the catalog is Iceberg's equivalent of a database transaction manager.
4.2 Catalog Options
| Catalog | Characteristic | Recommended For |
|---|---|---|
| Hive Metastore | Reuses existing HMS | On-prem with heavy Hive footprint |
| Hadoop (file-based) | No catalog server; file lock | Dev/test only |
| REST Catalog | Engine-neutral standard API | First choice for new deployments |
| AWS Glue | AWS-integrated | AWS-native environments |
| Nessie | Git-like multi-table branching | Data versioning |
| Snowflake / Polaris | Managed | Sharing Iceberg with Snowflake |
| Unity Catalog | Databricks-integrated | Sharing Iceberg with Databricks |
| JDBC | Uses a relational DB | Simple metastore |
4.3 The Significance of the REST Catalog Standard
The REST Catalog defines the catalog as a language- and engine-neutral HTTP API. Spark, Trino, Flink, and PyIceberg all use the same API, so catalog implementations are freely interchangeable. As of 2026, REST Catalog is the de facto standard in the Iceberg ecosystem.
5. Using Iceberg
5.1 Spark
-- Spark configuration (REST Catalog)
spark.sql.catalog.demo = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.demo.type = rest
spark.sql.catalog.demo.uri = http://iceberg-rest:8181
spark.sql.catalog.demo.warehouse = s3://my-warehouseCREATE TABLE demo.db.orders (
id BIGINT,
customer_id BIGINT,
ts TIMESTAMP,
amount DECIMAL(10,2)
) USING iceberg
PARTITIONED BY (days(ts));
INSERT INTO demo.db.orders VALUES (1, 100, current_timestamp(), 99.99);
MERGE INTO demo.db.orders t
USING updates u
ON t.id = u.id
WHEN MATCHED THEN UPDATE SET amount = u.amount
WHEN NOT MATCHED THEN INSERT *;5.2 Trino
-- Trino catalog config (iceberg.properties)
connector.name = iceberg
iceberg.catalog.type = rest
iceberg.rest-catalog.uri = http://iceberg-rest:8181SELECT * FROM iceberg.db.orders WHERE ts >= DATE '2026-05-01';
-- Metadata tables
SELECT * FROM iceberg.db."orders$snapshots";
SELECT * FROM iceberg.db."orders$files";
SELECT * FROM iceberg.db."orders$partitions";5.3 Streaming Ingest with Flink
CREATE TABLE orders_iceberg (
id BIGINT,
ts TIMESTAMP(3),
amount DECIMAL(10,2)
) WITH (
'connector' = 'iceberg',
'catalog-type' = 'rest',
'uri' = 'http://iceberg-rest:8181',
'warehouse' = 's3://my-warehouse'
);
INSERT INTO orders_iceberg
SELECT * FROM kafka_orders;5.4 PyIceberg
from pyiceberg.catalog import load_catalog
catalog = load_catalog(
"demo",
**{
"type": "rest",
"uri": "http://iceberg-rest:8181",
"warehouse": "s3://my-warehouse",
},
)
table = catalog.load_table("db.orders")
# Read directly into pandas
df = table.scan(
row_filter="ts >= '2026-05-01'",
selected_fields=("id", "amount"),
).to_pandas()6. Operations and Maintenance
6.1 Compaction — The Small File Problem
Streaming ingest accumulates small files that crater query performance. Regular compaction is essential.
-- Spark Action
CALL demo.system.rewrite_data_files(
table => 'db.orders',
options => map('target-file-size-bytes', '536870912') -- 512MB
);6.2 Snapshot Expiration
Snapshots accumulate, bloating metadata and preventing data files from being GC'd.
CALL demo.system.expire_snapshots(
table => 'db.orders',
older_than => TIMESTAMP '2026-05-01 00:00:00',
retain_last => 10
);6.3 Orphan File Cleanup
Failed writes can leave behind files not referenced by metadata.
CALL demo.system.remove_orphan_files(
table => 'db.orders',
older_than => TIMESTAMP '2026-05-01 00:00:00'
);6.4 Metadata Rewrite
Too many manifests inflate plan time.
CALL demo.system.rewrite_manifests(table => 'db.orders');6.5 Standard Operational Automation Pattern
| Task | Cadence | Tooling |
|---|---|---|
rewrite_data_files | Hourly/daily | Spark Action, Airflow |
expire_snapshots | Daily | Spark Action |
remove_orphan_files | Weekly | Spark Action |
rewrite_manifests | As needed | Spark Action |
| Metrics collection | Per-minute | Catalog API + Prometheus |
Without operational automation, Iceberg quickly devolves into a "metadata swamp."
7. Iceberg Spec Evolution (V1 → V2 → V3)
| Version | Release | Key Features |
|---|---|---|
| V1 | 2019 | Snapshots, schema/partition evolution, hidden partitioning |
| V2 | 2021 | Row-level Delete (position/equality), MoR |
| V3 | 2025+ | Deletion Vectors, Variant type, Geospatial, default values, row lineage |
V3's Deletion Vectors replace position delete files with compact bitmaps, dramatically improving MoR read performance.
Compatibility notes:
- Engines that only support older versions cannot read newer-version tables.
- V2 → V3 upgrade is one-way.
- Verify all client (Spark, Trino, Flink) versions in advance.
8. Comparison with Delta Lake / Hudi
| Aspect | Iceberg | Delta Lake | Hudi |
|---|---|---|---|
| Original Author | Netflix | Databricks | Uber |
| Transaction Log | Manifest tree | JSON log | Timeline |
| Hidden Partitioning | OK | Partial | Partial |
| Partition Evolution | OK | No | No |
| Schema Evolution | Strong | Strong | Strong |
| CoW / MoR | Both | CoW (DV added) | Both |
| Branch / Tag | OK | No | No |
| REST Catalog standard | OK | UC-coupled | No |
| Single-engine fit | Decent | Excellent (Databricks) | Decent |
| Multi-engine fit | Excellent | Decent (UniForm) | Decent |
Decision criteria:
- Databricks-only environment → Delta Lake
- Frequent upsert + streaming → Hudi
- Multi-engine, long retention, evolution semantics → Iceberg
With Delta Lake's UniForm and Iceberg's XTable style interop layers, format choice will matter less over time.
9. Performance Best Practices
9.1 File Size
Files that are too small (<100MB) inflate plan overhead; files that are too large (>2GB) reduce parallelism. Aim for 256MB–1GB.
ALTER TABLE orders SET TBLPROPERTIES (
'write.target-file-size-bytes' = '536870912'
);9.2 Partitioning Strategy
- Don't partition by extremely high-cardinality columns (e.g.,
user_id). - Almost always partition time columns with
dayorhourtransforms. - Use
bucket(N, col)for high-cardinality columns.
9.3 Sort Order / Z-Order
Sorting in line with query patterns maximizes manifest pruning effectiveness.
ALTER TABLE orders WRITE ORDERED BY ts, customer_id;9.4 Manifest Sharding
For very large tables, split manifests by partition to reduce plan cost. Use the rewrite_manifests action.
9.5 Anti-Patterns
- Not running snapshot expiration → metadata explosion
- Not running compaction → hundreds of thousands of small files
- Committing too frequently → manifest explosion
- Too many partition columns → list cost spikes
- Using Hadoop Catalog in production → commit conflicts and data loss risk
10. Practical Adoption Guide
10.1 Migrating Existing Hive Tables
Two approaches.
| Approach | Behavior | Cost | Rollback |
|---|---|---|---|
Snapshot (snapshot procedure) | Keep the original Hive table; create an Iceberg shadow | Low | Easy |
Migrate (migrate procedure) | Convert the original in-place to Iceberg | Low | Hard |
-- Approach 1: Validation shadow
CALL demo.system.snapshot('hive_db.orders', 'demo.db.orders');
-- Approach 2: Full cutover after validation
CALL demo.system.migrate('hive_db.orders');10.2 Building a CDC Pipeline
RDB → Debezium → Kafka → Flink → Iceberg (MoR)
│
└─▶ Compaction (batch)
Key points:
- Flink Iceberg sink in
upsertmode uses equality deletes - Minute-level commits (not too frequent)
- Nightly batch compaction + snapshot expiration
10.3 Governance and Security
- AWS Lake Formation: Row/column-level permissions on Glue Catalog + Iceberg
- Apache Ranger: Policy-based authorization for Hive/Trino + Iceberg
- Unity Catalog: Governs Iceberg external tables in Databricks
10.4 Cost Optimization
- Tune compaction file size to cut list/get call counts
- Differentiate snapshot retention by workload (time-series vs master)
- S3 Intelligent-Tiering for aging snapshot data
- Manifest caching (Spark, Trino) to reduce plan time
11. Wrap-Up
11.1 When to Choose Iceberg
If two or more of the following apply, Iceberg is a clear winner.
- You use two or more query engines simultaneously (Spark + Trino, Flink + Snowflake, etc.)
- You need time travel and legal corrections on long-retention data
- You may change partition strategy in the future
- You need branches/tags for experiments and reproducibility
- You want to avoid lock-in to a specific vendor (e.g., Databricks)
11.2 The Future of the Iceberg Ecosystem
- REST Catalog standardization: The catalog is becoming the true control plane of the Lakehouse.
- V3 adoption: Deletion Vectors, Variant, and Geospatial expand the addressable workloads.
- Interoperability: XTable and UniForm blur the lines between Iceberg ↔ Delta ↔ Hudi.
- Managed catalog competition: Snowflake Polaris, Databricks Unity, AWS Glue, and Tabular's successors form a new market.
11.3 Learning Resources
- Apache Iceberg Official Docs
- Iceberg Spec v2 / v3
- Related posts on this blog:
Iceberg is no longer a "new option" — it is the de facto standard for multi-engine Lakehouses, sitting at the center of the shift that makes the catalog layer the control plane of the data platform. Beyond the decision to adopt, invest more thought in operational automation and catalog strategy.