How Trino + Iceberg Solves the Partition Problem
Why Hive/Impala triggers full table scans when partition columns are missing from WHERE clauses, and how Trino + Iceberg's hidden partitioning and metadata-based pruning structurally eliminates this problem.
Building Parquet-formatted Iceberg tables in Trino and querying the same data from Impala and Hive is a common architecture. But have you ever experienced a multi-TB full scan that brought your cluster to a halt, simply because someone forgot to include the partition column in a WHERE clause on Impala or Hive? This post explains why Hive/Impala structurally cannot avoid this problem, and how Trino + Iceberg fundamentally eliminates it.
1. The Problem: Missing a Partition Column Kills Your Cluster
Here is the typical pattern for creating a partitioned table in Hive/Impala:
CREATE TABLE events (
event_id BIGINT,
user_id BIGINT,
event_type STRING,
event_time TIMESTAMP,
city STRING,
payload STRING
)
PARTITIONED BY (dt STRING)
STORED AS PARQUET;Assume this table holds one year of data — approximately 500GB across 365 partitions.
-- OK: partition pruning works, scans only ~1.4GB (one day)
SELECT * FROM events WHERE dt = '2026-04-12' AND user_id = 12345;
-- PROBLEM: dt partition column missing → full scan of all 365 partitions (500GB)
SELECT * FROM events WHERE user_id = 12345;The second query only has a user_id condition without dt. Hive/Impala has no way to know which partitions contain user_id = 12345, so it must open every file in every partition to find out.
2. Why Hive/Impala Cannot Avoid This
Directory-Based Partitioning
In Hive/Impala, a partition is a physical directory:
/warehouse/events/dt=2026-04-01/
/warehouse/events/dt=2026-04-02/
/warehouse/events/dt=2026-04-03/
...
/warehouse/events/dt=2026-04-12/
├── 000000_0.parquet
├── 000001_0.parquet
└── 000002_0.parquetThe Hive Metastore (HMS) maintains the list of partition directories, but knows nothing about the data inside each file (such as min/max statistics). The query planner can only prune partitions if the partition column (dt) appears explicitly in the WHERE clause.
Strict Mode Blocks but Doesn't Solve
Hive's hive.mapred.mode=strict and Impala's query options can reject queries without partition columns. But this merely prevents mistakes — it does not enable efficient querying without partition columns.
3. Partition Count Limits: Hive/Impala vs Trino+Iceberg
Finer partitioning improves query performance, but in Hive/Impala, the partition count itself becomes a bottleneck.
Hive/Impala Partition Limits
| Item | Hive | Impala |
|---|---|---|
| Partition column count | No hard limit (3–4 recommended) | Same |
| Max partition count | HMS performance degrades at tens to hundreds of thousands | Catalogd loads all partition metadata in memory; OOM beyond hundreds of thousands |
| Dynamic partition creation limit | hive.exec.max.dynamic.partitions default 1000 | Controlled by MAX_DYNAMIC_PARTITIONS query option |
| Planning overhead | RPC to HMS → query planning alone takes minutes with many partitions | Additional HDFS NameNode + Catalogd sync cost |
Combining 2–3 partition columns causes cardinality explosion. For example, dt × region × device_type = 365 × 50 × 10 = 182,500 partitions/year. This design is impractical in Hive/Impala.
Trino + Iceberg Partition Limits
| Item | Trino + Iceberg |
|---|---|
| Partition column count | No limit per Iceberg spec |
| Max partition count | Effectively unlimited. Partitions are metadata within manifest files, not HMS records |
| Dynamic partition creation limit | None |
| Planning overhead | Only reads manifest files; no HMS RPC required |
Iceberg stores partition information in manifest files, not in HMS. As partition count grows, there are no RPCs to HMS. Even with millions of partitions, manifest pruning quickly filters to the relevant files.
Comparison Summary
| Item | Hive/Impala | Trino + Iceberg |
|---|---|---|
| Partition = | HMS record + HDFS directory | Metadata within manifests |
| Practical partition limit | Tens to hundreds of thousands | Effectively unlimited |
| Partition creation cost | HMS ALTER TABLE ADD PARTITION (RPC) | Automatic on data write (manifest commit) |
| Partition count × planning time | Linear increase | Nearly constant |
| Dynamic partition limit | Default 1000/query | None |
Practical implication: In Hive/Impala, hourly partitioning is risky. In Trino + Iceberg, hours(event_time) is no problem, and composite partitions like days(event_time), bucket(32, user_id) are effortless.
4. How Iceberg Differs: Metadata-Based Architecture
Iceberg manages table metadata in a three-layer structure:
Metadata file (JSON)
└── Snapshot
└── Manifest list (Avro)
└── Manifest file (Avro)
└── Per-data-file information
├── File path
├── Partition values
├── Row count
├── Per-column min / max
├── Null count
└── File sizeThe key is the manifest file. Each data file's partition values and per-column min/max statistics are stored in the manifest, so Trino can determine whether a file is relevant to the query without opening the data file.
Fundamental difference from Hive/Impala:
| Item | Hive/Impala | Iceberg |
|---|---|---|
| Metadata location | HMS (external RPC) | Manifest files (direct storage read) |
| File-level statistics | None (partition-level only) | min/max, null count, row count, etc. |
| File pruning | Impossible | Manifest-based pruning |
5. Hidden Partitioning — Users Don't Need to Know About Partitions
This is Iceberg's most significant differentiator.
Hive Approach
-- Partition column explicitly declared in schema
CREATE TABLE events (...) PARTITIONED BY (dt STRING);
-- INSERT requires explicit partition value
INSERT INTO events PARTITION (dt = '2026-04-12') VALUES (...);
-- SELECT must include partition column for pruning
SELECT * FROM events WHERE dt = '2026-04-12';Users must know about the dt partition column and specify it in both INSERT and SELECT.
Iceberg Approach: Partition Transforms
-- No partition column exposed in schema
CREATE TABLE iceberg_catalog.db.events (
event_id BIGINT,
user_id BIGINT,
event_type VARCHAR,
event_time TIMESTAMP(6) WITH TIME ZONE,
city VARCHAR,
payload VARCHAR
)
WITH (
format = 'PARQUET',
partitioning = ARRAY['days(event_time)']
);Day-level partitions are automatically derived from event_time. There is no separate dt partition column in the schema.
-- Users simply filter on event_time
SELECT * FROM events
WHERE event_time >= TIMESTAMP '2026-04-12 00:00:00 UTC'
AND event_time < TIMESTAMP '2026-04-13 00:00:00 UTC';Trino knows about the days(event_time) partition transform and automatically scans only the relevant day's partition. Since there is no partition column in the schema, there is nothing to forget.
Supported Partition Transforms:
| Transform | Description | Example |
|---|---|---|
years(col) | Yearly | years(event_time) |
months(col) | Monthly | months(event_time) |
days(col) | Daily | days(event_time) |
hours(col) | Hourly | hours(event_time) |
bucket(N, col) | Hash into N buckets | bucket(32, user_id) |
truncate(W, col) | Truncate string/number to width W | truncate(10, city) |
6. Is a Partition Column Required in Trino + Iceberg?
Short Answer: No, But It Changes What Happens
Iceberg tables can be created and queried without any partitioning:
-- Iceberg table without partitioning
CREATE TABLE iceberg_catalog.db.events_no_partition (
event_id BIGINT,
user_id BIGINT,
event_time TIMESTAMP(6) WITH TIME ZONE,
city VARCHAR
)
WITH (format = 'PARQUET');Querying with WHERE event_time = TIMESTAMP '2026-04-12 10:00:00 UTC':
- No partition values in manifest → all manifest entries must be checked
- But Parquet footer column statistics (min/max) can skip some row groups
- Result: All files are opened, but some row groups are skipped → better than Hive full scan, but much slower than with partitioning
| Configuration | File Pruning | Row Group Pruning | Scan Scope |
|---|---|---|---|
| No partitioning | Impossible | Partial skip via Parquet min/max | Nearly everything |
| With partitioning (hidden) | Most files skipped via manifest | Additional row group skipping | Only relevant partition files |
What Happens When WHERE Uses a Non-Partition Column?
If partitioning is days(event_time) but the query is WHERE user_id = 12345:
- Manifest pruning:
user_idis unrelated to the partition transform → cannot reduce files → all data files are candidates - Parquet min/max pruning: Check each file's
user_idcolumn min/max → skip files where 12345 is out of range - Actual reads: Only open files where 12345 falls within the min/max range, then filter at row group level
Result: Zero partition pruning effect; relies entirely on Parquet statistics. If data is unsorted, this is effectively a full scan.
Mitigations:
-- Option 1: Add bucket partitioning for user_id
ALTER TABLE events SET PROPERTIES
partitioning = ARRAY['days(event_time)', 'bucket(64, user_id)'];
-- Option 2: Set sort key to maximize min/max statistics effectiveness
ALTER TABLE events SET PROPERTIES
sorted_by = ARRAY['user_id'];
ALTER TABLE events EXECUTE optimize;What Happens When WHERE References a Non-Existent Column?
SELECT * FROM events WHERE non_existent_column = 'value';
-- Result: COLUMN_NOT_FOUND error, query fails immediatelyThe column doesn't exist in the Iceberg schema, so the query is rejected at the planning stage. This behavior is the same in Hive/Impala.
However, with Schema Evolution, columns added later will return NULL for data files that predate the column addition. The query itself succeeds.
7. Trino's File Pruning Mechanisms
Trino + Iceberg filters unnecessary data through a three-stage funnel:
[All data files]
│
▼
① Manifest-based pruning
(partition values + file-level min/max)
│ Remove unnecessary files
▼
② Parquet Row Group pruning
(per-row-group column statistics)
│ Skip unnecessary row groups
▼
③ Dynamic Filtering
(propagate JOIN build side results to probe side)
│ Further skip files/row groups
▼
[Actually read data]Each stage progressively reduces the data volume, resulting in dramatically less I/O.
8. Practical Comparison: Same Data, Same Query, Different Results
Scenario Setup
- Table: One year of event logs, ~500GB
- Partitioning: Daily (Hive:
dtcolumn, Iceberg:days(event_time)) - Parquet format
Query 1: Partition Column Specified
-- Hive/Impala
SELECT * FROM events WHERE dt = '2026-04-12' AND user_id = 12345;
-- Trino + Iceberg
SELECT * FROM events
WHERE event_time >= TIMESTAMP '2026-04-12 00:00:00 UTC'
AND event_time < TIMESTAMP '2026-04-13 00:00:00 UTC'
AND user_id = 12345;| Engine | Scan Scope | Notes |
|---|---|---|
| Hive/Impala | ~1.4GB (1 day) | Partition pruning works |
| Trino + Iceberg | ~1.4GB (1 day) | Hidden partition pruning works |
Both are fast.
Query 2: Partition Column Missing — Time Range Only
-- Hive/Impala: using event_time instead of dt
SELECT * FROM events WHERE event_time >= '2026-04-12' AND event_time < '2026-04-13';
-- Trino + Iceberg: same condition
SELECT * FROM events
WHERE event_time >= TIMESTAMP '2026-04-12 00:00:00 UTC'
AND event_time < TIMESTAMP '2026-04-13 00:00:00 UTC';| Engine | Scan Scope | Notes |
|---|---|---|
| Hive/Impala | ~500GB (full) | event_time is not a partition column → full scan |
| Trino + Iceberg | ~1.4GB (1 day) | days(event_time) transform enables auto-pruning |
350x difference. Full scan in Hive/Impala; one day's data in Trino + Iceberg.
Query 3: Column Unrelated to Partition
SELECT * FROM events WHERE user_id = 12345;| Engine | Scan Scope | Notes |
|---|---|---|
| Hive/Impala | ~500GB (full) | Full scan |
Trino + Iceberg (partition: days(event_time) only) | ~500GB (full) | user_id unrelated to partition; relies on Parquet min/max only |
Trino + Iceberg (partition: days(event_time) + bucket(64, user_id)) | ~7.8GB (1/64) | Bucket partition pruning |
Partition design must match query patterns to be effective.
Query 4: Non-Partitioned Iceberg Table
SELECT * FROM events_no_partition
WHERE event_time >= TIMESTAMP '2026-04-12 00:00:00 UTC'
AND event_time < TIMESTAMP '2026-04-13 00:00:00 UTC';| Engine | Scan Scope | Notes |
|---|---|---|
| Trino + Iceberg (no partition) | Nearly everything | Row group skip via Parquet min/max; no file pruning |
| Trino + Iceberg (with partition) | ~1.4GB | Most files skipped via partition pruning |
Non-partitioned Iceberg is slightly better than a Hive full scan, but there is a 10–50x performance gap compared to having partitions.
9. Iceberg Table Creation and Partition Configuration
Basic Creation
CREATE TABLE iceberg_catalog.db.events (
event_id BIGINT,
user_id BIGINT,
event_type VARCHAR,
event_time TIMESTAMP(6) WITH TIME ZONE,
city VARCHAR,
payload VARCHAR
)
WITH (
format = 'PARQUET',
partitioning = ARRAY['days(event_time)']
);Composite Partitioning
CREATE TABLE iceberg_catalog.db.events (
event_id BIGINT,
user_id BIGINT,
event_type VARCHAR,
event_time TIMESTAMP(6) WITH TIME ZONE,
city VARCHAR
)
WITH (
format = 'PARQUET',
partitioning = ARRAY['days(event_time)', 'bucket(32, user_id)']
);Partition Evolution — Changing Partition Strategy Without Rewriting Data
This is an Iceberg-specific capability not available in Hive/Impala.
-- Before: daily partitions
-- After: switch to hourly partitions
ALTER TABLE events SET PROPERTIES
partitioning = ARRAY['hours(event_time)'];Only newly written data receives hourly partitioning; existing data retains daily partitions. Iceberg's manifest tracks partition spec versions, so pruning works correctly even with mixed partition specs.
10. When Trino + Iceberg Performance Degrades
Trino + Iceberg is not a silver bullet. Performance degrades significantly in these situations.
10.1 Small File Problem
With frequent INSERTs, hundreds to thousands of small files accumulate per partition. Each file has a manifest entry, and each Parquet footer must be read, resulting in file count × I/O overhead.
Symptom: Small data volume but query planning takes tens of seconds.
-- Compact small files
ALTER TABLE events EXECUTE optimize;
-- Or compact specific partitions only
ALTER TABLE events EXECUTE optimize
WHERE event_time >= TIMESTAMP '2026-04-01 00:00:00 UTC'
AND event_time < TIMESTAMP '2026-05-01 00:00:00 UTC';10.2 Manifest Bloat
When file count exceeds hundreds of thousands, reading manifests themselves becomes slow.
-- Check file count
SELECT count(*) AS file_count,
sum(file_size_in_bytes) / (1024*1024*1024) AS total_gb
FROM "events$files";10.3 WHERE Conditions That Break Predicate Pushdown
Wrapping columns in functions in the WHERE clause breaks predicate pushdown.
| Pattern | Pushdown | Notes |
|---|---|---|
WHERE event_time = TIMESTAMP '...' | Yes | Direct comparison |
WHERE CAST(event_time AS DATE) = DATE '...' | No | Function applied → full scan |
WHERE year(event_time) = 2026 | No | Function wrapper → full scan |
WHERE LOWER(city) = 'seoul' | No | String function → full scan |
WHERE city LIKE '%eoul' | No | Leading wildcard → min/max unusable |
WHERE city IN ('Seoul', 'Busan') | Yes | IN list → pushdown works |
WHERE event_time BETWEEN ... AND ... | Yes | Range → pushdown + partition pruning |
Key rule: Keep the column bare on the left side; apply transformations on the right (constant) side.
-- BAD: no pushdown
WHERE CAST(event_time AS DATE) = DATE '2026-04-12'
-- GOOD: pushdown works
WHERE event_time >= TIMESTAMP '2026-04-12 00:00:00 UTC'
AND event_time < TIMESTAMP '2026-04-13 00:00:00 UTC'10.4 Unsorted Data
Parquet row group min/max statistics are effective only when data is sorted.
- Sorted: Row group 1 has
user_id 1–1000, row group 2 has1001–2000→WHERE user_id = 500reads only row group 1 - Unsorted: All row groups have
user_id 1–100000→ must read everything
-- Set sort key
ALTER TABLE events SET PROPERTIES
sorted_by = ARRAY['user_id'];
-- Re-sort existing data
ALTER TABLE events EXECUTE optimize;10.5 Dynamic Filtering Not Working in JOINs
Dynamic Filtering fails when:
- Build side result is too large (hundreds of thousands of rows) → Trino abandons dynamic filter generation
LEFT JOIN/FULL OUTER JOIN→ probe side cannot be reduced- JOIN key is unrelated to partition columns
10.6 Excessive Snapshot Accumulation
Iceberg creates a new snapshot on every write. Over time, thousands of snapshots accumulate, increasing metadata file size and slowing planning.
-- Expire old snapshots
ALTER TABLE events EXECUTE expire_snapshots(retention_threshold => '7d');
-- Remove orphan files
ALTER TABLE events EXECUTE remove_orphan_files(retention_threshold => '7d');10.7 Worker Memory Exhaustion
Regardless of partition or file count, aggregating or joining large data volumes can exceed Worker memory. Trino can spill to disk, but performance drops significantly.
Mitigations:
- Adjust
query.max-memory-per-node - Avoid
SELECT *; specify only needed columns - Break large aggregations into stages
11. Performance Tips
| Item | Setting/Command | Effect |
|---|---|---|
| Collect statistics | ANALYZE events | Better execution plans from Trino planner |
| Row group size | write.parquet.row-group-size-bytes | Tune row group pruning effectiveness |
| Sort key | sorted_by = ARRAY['col'] | Narrow min/max ranges for better pruning |
| Compaction | ALTER TABLE ... EXECUTE optimize | Merge small files, apply sorting |
| Snapshot management | EXECUTE expire_snapshots | Reduce metadata size |
| Orphan file removal | EXECUTE remove_orphan_files | Reclaim unnecessary storage |
12. Performance Degradation Diagnostic Checklist
When issues arise, check in this order:
| Step | Check | How |
|---|---|---|
| 1 | Are columns wrapped in functions in WHERE? | Query review |
| 2 | Is file/partition pruning working? | EXPLAIN (TYPE DISTRIBUTED) |
| 3 | Too many small files? | SELECT count(*), sum(file_size_in_bytes) FROM "table$files" |
| 4 | Is data sorted? | Check sorted_by property |
| 5 | Too many snapshots? | SELECT count(*) FROM "table$snapshots" |
| 6 | Is Dynamic Filtering active? | Check for dynamicFilter in EXPLAIN |
| 7 | Sufficient Worker memory? | Trino Web UI → Query Detail → per-stage memory |
13. Summary
| Item | Hive/Impala | Trino + Iceberg |
|---|---|---|
| Partition structure | Directory-based, HMS-managed | Manifest-based, HMS-independent |
| Missing partition column | Full scan | Hidden partitioning enables auto-pruning |
| Partition count limit | Tens to hundreds of thousands | Effectively unlimited |
| File-level pruning | Impossible | Manifest min/max statistics |
| Partition strategy change | Requires table recreation | Partition Evolution (zero downtime) |
| Dynamic partition limit | Yes (default 1000) | None |
The partition problem in Hive/Impala is not "user error" — it is a structural limitation of directory-based partitioning. Iceberg eliminates this limitation by bringing the metadata layer into the table format itself, and Trino maximizes the use of this metadata to minimize unnecessary I/O.
That said, Trino + Iceberg is not a cure-all. Small file problems, SQL patterns that break predicate pushdown, unsorted data, and snapshot accumulation can still degrade performance. The key is to verify execution plans with EXPLAIN and make regular compaction and snapshot management a habit.
This post was written based on Trino 443 and Iceberg spec v2. If you need help with partition design or migration in your environment, feel free to reach out.
— Data Dynamics Engineering Team