trinoicebergparquetperformance

How Trino + Iceberg Solves the Partition Problem

Why Hive/Impala triggers full table scans when partition columns are missing from WHERE clauses, and how Trino + Iceberg's hidden partitioning and metadata-based pruning structurally eliminates this problem.

Data DynamicsMay 30, 202616 min read

Building Parquet-formatted Iceberg tables in Trino and querying the same data from Impala and Hive is a common architecture. But have you ever experienced a multi-TB full scan that brought your cluster to a halt, simply because someone forgot to include the partition column in a WHERE clause on Impala or Hive? This post explains why Hive/Impala structurally cannot avoid this problem, and how Trino + Iceberg fundamentally eliminates it.

1. The Problem: Missing a Partition Column Kills Your Cluster

Here is the typical pattern for creating a partitioned table in Hive/Impala:

CREATE TABLE events (
  event_id    BIGINT,
  user_id     BIGINT,
  event_type  STRING,
  event_time  TIMESTAMP,
  city        STRING,
  payload     STRING
)
PARTITIONED BY (dt STRING)
STORED AS PARQUET;

Assume this table holds one year of data — approximately 500GB across 365 partitions.

-- OK: partition pruning works, scans only ~1.4GB (one day)
SELECT * FROM events WHERE dt = '2026-04-12' AND user_id = 12345;
 
-- PROBLEM: dt partition column missing → full scan of all 365 partitions (500GB)
SELECT * FROM events WHERE user_id = 12345;

The second query only has a user_id condition without dt. Hive/Impala has no way to know which partitions contain user_id = 12345, so it must open every file in every partition to find out.

2. Why Hive/Impala Cannot Avoid This

Directory-Based Partitioning

In Hive/Impala, a partition is a physical directory:

/warehouse/events/dt=2026-04-01/
/warehouse/events/dt=2026-04-02/
/warehouse/events/dt=2026-04-03/
...
/warehouse/events/dt=2026-04-12/
  ├── 000000_0.parquet
  ├── 000001_0.parquet
  └── 000002_0.parquet

The Hive Metastore (HMS) maintains the list of partition directories, but knows nothing about the data inside each file (such as min/max statistics). The query planner can only prune partitions if the partition column (dt) appears explicitly in the WHERE clause.

Strict Mode Blocks but Doesn't Solve

Hive's hive.mapred.mode=strict and Impala's query options can reject queries without partition columns. But this merely prevents mistakes — it does not enable efficient querying without partition columns.

3. Partition Count Limits: Hive/Impala vs Trino+Iceberg

Finer partitioning improves query performance, but in Hive/Impala, the partition count itself becomes a bottleneck.

Hive/Impala Partition Limits

Item	Hive	Impala
Partition column count	No hard limit (3–4 recommended)	Same
Max partition count	HMS performance degrades at tens to hundreds of thousands	Catalogd loads all partition metadata in memory; OOM beyond hundreds of thousands
Dynamic partition creation limit	`hive.exec.max.dynamic.partitions` default 1000	Controlled by `MAX_DYNAMIC_PARTITIONS` query option
Planning overhead	RPC to HMS → query planning alone takes minutes with many partitions	Additional HDFS NameNode + Catalogd sync cost

Combining 2–3 partition columns causes cardinality explosion. For example, dt × region × device_type = 365 × 50 × 10 = 182,500 partitions/year. This design is impractical in Hive/Impala.

Trino + Iceberg Partition Limits

Item	Trino + Iceberg
Partition column count	No limit per Iceberg spec
Max partition count	Effectively unlimited. Partitions are metadata within manifest files, not HMS records
Dynamic partition creation limit	None
Planning overhead	Only reads manifest files; no HMS RPC required

Iceberg stores partition information in manifest files, not in HMS. As partition count grows, there are no RPCs to HMS. Even with millions of partitions, manifest pruning quickly filters to the relevant files.

Comparison Summary

Item	Hive/Impala	Trino + Iceberg
Partition =	HMS record + HDFS directory	Metadata within manifests
Practical partition limit	Tens to hundreds of thousands	Effectively unlimited
Partition creation cost	HMS ALTER TABLE ADD PARTITION (RPC)	Automatic on data write (manifest commit)
Partition count × planning time	Linear increase	Nearly constant
Dynamic partition limit	Default 1000/query	None

Practical implication: In Hive/Impala, hourly partitioning is risky. In Trino + Iceberg, hours(event_time) is no problem, and composite partitions like days(event_time), bucket(32, user_id) are effortless.

4. How Iceberg Differs: Metadata-Based Architecture

Iceberg manages table metadata in a three-layer structure:

Loading diagram…

The key is the manifest file. Each data file's partition values and per-column min/max statistics are stored in the manifest, so Trino can determine whether a file is relevant to the query without opening the data file.

Fundamental difference from Hive/Impala:

Item	Hive/Impala	Iceberg
Metadata location	HMS (external RPC)	Manifest files (direct storage read)
File-level statistics	None (partition-level only)	min/max, null count, row count, etc.
File pruning	Impossible	Manifest-based pruning

5. Hidden Partitioning — Users Don't Need to Know About Partitions

This is Iceberg's most significant differentiator.

Hive Approach

-- Partition column explicitly declared in schema
CREATE TABLE events (...) PARTITIONED BY (dt STRING);
 
-- INSERT requires explicit partition value
INSERT INTO events PARTITION (dt = '2026-04-12') VALUES (...);
 
-- SELECT must include partition column for pruning
SELECT * FROM events WHERE dt = '2026-04-12';

Users must know about the dt partition column and specify it in both INSERT and SELECT.

Iceberg Approach: Partition Transforms

-- No partition column exposed in schema
CREATE TABLE iceberg_catalog.db.events (
  event_id    BIGINT,
  user_id     BIGINT,
  event_type  VARCHAR,
  event_time  TIMESTAMP(6) WITH TIME ZONE,
  city        VARCHAR,
  payload     VARCHAR
)
WITH (
  format = 'PARQUET',
  partitioning = ARRAY['days(event_time)']
);

Day-level partitions are automatically derived from event_time. There is no separate dt partition column in the schema.

-- Users simply filter on event_time
SELECT * FROM events
WHERE event_time >= TIMESTAMP '2026-04-12 00:00:00 UTC'
  AND event_time <  TIMESTAMP '2026-04-13 00:00:00 UTC';

Trino knows about the days(event_time) partition transform and automatically scans only the relevant day's partition. Since there is no partition column in the schema, there is nothing to forget.

Supported Partition Transforms:

Transform	Description	Example
`years(col)`	Yearly	`years(event_time)`
`months(col)`	Monthly	`months(event_time)`
`days(col)`	Daily	`days(event_time)`
`hours(col)`	Hourly	`hours(event_time)`
`bucket(N, col)`	Hash into N buckets	`bucket(32, user_id)`
`truncate(W, col)`	Truncate string/number to width W	`truncate(10, city)`

6. Is a Partition Column Required in Trino + Iceberg?

Short Answer: No, But It Changes What Happens

Iceberg tables can be created and queried without any partitioning:

-- Iceberg table without partitioning
CREATE TABLE iceberg_catalog.db.events_no_partition (
  event_id    BIGINT,
  user_id     BIGINT,
  event_time  TIMESTAMP(6) WITH TIME ZONE,
  city        VARCHAR
)
WITH (format = 'PARQUET');

Querying with WHERE event_time = TIMESTAMP '2026-04-12 10:00:00 UTC':

No partition values in manifest → all manifest entries must be checked
But Parquet footer column statistics (min/max) can skip some row groups
Result: All files are opened, but some row groups are skipped → better than Hive full scan, but much slower than with partitioning

Configuration	File Pruning	Row Group Pruning	Scan Scope
No partitioning	Impossible	Partial skip via Parquet min/max	Nearly everything
With partitioning (hidden)	Most files skipped via manifest	Additional row group skipping	Only relevant partition files

What Happens When WHERE Uses a Non-Partition Column?

If partitioning is days(event_time) but the query is WHERE user_id = 12345:

Manifest pruning: user_id is unrelated to the partition transform → cannot reduce files → all data files are candidates
Parquet min/max pruning: Check each file's user_id column min/max → skip files where 12345 is out of range
Actual reads: Only open files where 12345 falls within the min/max range, then filter at row group level

Result: Zero partition pruning effect; relies entirely on Parquet statistics. If data is unsorted, this is effectively a full scan.

Mitigations:

-- Option 1: Add bucket partitioning for user_id
ALTER TABLE events SET PROPERTIES
  partitioning = ARRAY['days(event_time)', 'bucket(64, user_id)'];
 
-- Option 2: Set sort key to maximize min/max statistics effectiveness
ALTER TABLE events SET PROPERTIES
  sorted_by = ARRAY['user_id'];
 
ALTER TABLE events EXECUTE optimize;

What Happens When WHERE References a Non-Existent Column?

SELECT * FROM events WHERE non_existent_column = 'value';
-- Result: COLUMN_NOT_FOUND error, query fails immediately

The column doesn't exist in the Iceberg schema, so the query is rejected at the planning stage. This behavior is the same in Hive/Impala.

However, with Schema Evolution, columns added later will return NULL for data files that predate the column addition. The query itself succeeds.

7. Trino's File Pruning Mechanisms

Trino + Iceberg filters unnecessary data through a three-stage funnel:

Loading diagram…

Each stage progressively reduces the data volume, resulting in dramatically less I/O.

8. Practical Comparison: Same Data, Same Query, Different Results

Scenario Setup

Table: One year of event logs, ~500GB
Partitioning: Daily (Hive: dt column, Iceberg: days(event_time))
Parquet format

Query 1: Partition Column Specified

-- Hive/Impala
SELECT * FROM events WHERE dt = '2026-04-12' AND user_id = 12345;
 
-- Trino + Iceberg
SELECT * FROM events
WHERE event_time >= TIMESTAMP '2026-04-12 00:00:00 UTC'
  AND event_time <  TIMESTAMP '2026-04-13 00:00:00 UTC'
  AND user_id = 12345;

Engine	Scan Scope	Notes
Hive/Impala	~1.4GB (1 day)	Partition pruning works
Trino + Iceberg	~1.4GB (1 day)	Hidden partition pruning works

Both are fast.

Query 2: Partition Column Missing — Time Range Only

-- Hive/Impala: using event_time instead of dt
SELECT * FROM events WHERE event_time >= '2026-04-12' AND event_time < '2026-04-13';
 
-- Trino + Iceberg: same condition
SELECT * FROM events
WHERE event_time >= TIMESTAMP '2026-04-12 00:00:00 UTC'
  AND event_time <  TIMESTAMP '2026-04-13 00:00:00 UTC';

Engine	Scan Scope	Notes
Hive/Impala	~500GB (full)	`event_time` is not a partition column → full scan
Trino + Iceberg	~1.4GB (1 day)	`days(event_time)` transform enables auto-pruning

350x difference. Full scan in Hive/Impala; one day's data in Trino + Iceberg.

Query 3: Column Unrelated to Partition

SELECT * FROM events WHERE user_id = 12345;

Engine	Scan Scope	Notes
Hive/Impala	~500GB (full)	Full scan
Trino + Iceberg (partition: `days(event_time)` only)	~500GB (full)	`user_id` unrelated to partition; relies on Parquet min/max only
Trino + Iceberg (partition: `days(event_time)` + `bucket(64, user_id)`)	~7.8GB (1/64)	Bucket partition pruning

Partition design must match query patterns to be effective.

Query 4: Non-Partitioned Iceberg Table

SELECT * FROM events_no_partition
WHERE event_time >= TIMESTAMP '2026-04-12 00:00:00 UTC'
  AND event_time <  TIMESTAMP '2026-04-13 00:00:00 UTC';

Engine	Scan Scope	Notes
Trino + Iceberg (no partition)	Nearly everything	Row group skip via Parquet min/max; no file pruning
Trino + Iceberg (with partition)	~1.4GB	Most files skipped via partition pruning

Non-partitioned Iceberg is slightly better than a Hive full scan, but there is a 10–50x performance gap compared to having partitions.

9. Iceberg Table Creation and Partition Configuration

Basic Creation

CREATE TABLE iceberg_catalog.db.events (
  event_id    BIGINT,
  user_id     BIGINT,
  event_type  VARCHAR,
  event_time  TIMESTAMP(6) WITH TIME ZONE,
  city        VARCHAR,
  payload     VARCHAR
)
WITH (
  format = 'PARQUET',
  partitioning = ARRAY['days(event_time)']
);

Composite Partitioning

CREATE TABLE iceberg_catalog.db.events (
  event_id    BIGINT,
  user_id     BIGINT,
  event_type  VARCHAR,
  event_time  TIMESTAMP(6) WITH TIME ZONE,
  city        VARCHAR
)
WITH (
  format = 'PARQUET',
  partitioning = ARRAY['days(event_time)', 'bucket(32, user_id)']
);

Partition Evolution — Changing Partition Strategy Without Rewriting Data

This is an Iceberg-specific capability not available in Hive/Impala.

-- Before: daily partitions
-- After: switch to hourly partitions
ALTER TABLE events SET PROPERTIES
  partitioning = ARRAY['hours(event_time)'];

Only newly written data receives hourly partitioning; existing data retains daily partitions. Iceberg's manifest tracks partition spec versions, so pruning works correctly even with mixed partition specs.

10. When Trino + Iceberg Performance Degrades

Trino + Iceberg is not a silver bullet. Performance degrades significantly in these situations.

10.1 Small File Problem

With frequent INSERTs, hundreds to thousands of small files accumulate per partition. Each file has a manifest entry, and each Parquet footer must be read, resulting in file count × I/O overhead.

Symptom: Small data volume but query planning takes tens of seconds.

-- Compact small files
ALTER TABLE events EXECUTE optimize;
 
-- Or compact specific partitions only
ALTER TABLE events EXECUTE optimize
  WHERE event_time >= TIMESTAMP '2026-04-01 00:00:00 UTC'
    AND event_time <  TIMESTAMP '2026-05-01 00:00:00 UTC';

10.2 Manifest Bloat

When file count exceeds hundreds of thousands, reading manifests themselves becomes slow.

-- Check file count
SELECT count(*) AS file_count,
       sum(file_size_in_bytes) / (1024*1024*1024) AS total_gb
FROM "events$files";

10.3 WHERE Conditions That Break Predicate Pushdown

Wrapping columns in functions in the WHERE clause breaks predicate pushdown.

Pattern	Pushdown	Notes
`WHERE event_time = TIMESTAMP '...'`	Yes	Direct comparison
`WHERE CAST(event_time AS DATE) = DATE '...'`	No	Function applied → full scan
`WHERE year(event_time) = 2026`	No	Function wrapper → full scan
`WHERE LOWER(city) = 'seoul'`	No	String function → full scan
`WHERE city LIKE '%eoul'`	No	Leading wildcard → min/max unusable
`WHERE city IN ('Seoul', 'Busan')`	Yes	IN list → pushdown works
`WHERE event_time BETWEEN ... AND ...`	Yes	Range → pushdown + partition pruning

Key rule: Keep the column bare on the left side; apply transformations on the right (constant) side.

-- BAD: no pushdown
WHERE CAST(event_time AS DATE) = DATE '2026-04-12'
 
-- GOOD: pushdown works
WHERE event_time >= TIMESTAMP '2026-04-12 00:00:00 UTC'
  AND event_time <  TIMESTAMP '2026-04-13 00:00:00 UTC'

10.4 Unsorted Data

Parquet row group min/max statistics are effective only when data is sorted.

Sorted: Row group 1 has user_id 1–1000, row group 2 has 1001–2000 → WHERE user_id = 500 reads only row group 1
Unsorted: All row groups have user_id 1–100000 → must read everything

-- Set sort key
ALTER TABLE events SET PROPERTIES
  sorted_by = ARRAY['user_id'];
 
-- Re-sort existing data
ALTER TABLE events EXECUTE optimize;

10.5 Dynamic Filtering Not Working in JOINs

Dynamic Filtering fails when:

Build side result is too large (hundreds of thousands of rows) → Trino abandons dynamic filter generation
LEFT JOIN / FULL OUTER JOIN → probe side cannot be reduced
JOIN key is unrelated to partition columns

10.6 Excessive Snapshot Accumulation

Iceberg creates a new snapshot on every write. Over time, thousands of snapshots accumulate, increasing metadata file size and slowing planning.

-- Expire old snapshots
ALTER TABLE events EXECUTE expire_snapshots(retention_threshold => '7d');
 
-- Remove orphan files
ALTER TABLE events EXECUTE remove_orphan_files(retention_threshold => '7d');

10.7 Worker Memory Exhaustion

Regardless of partition or file count, aggregating or joining large data volumes can exceed Worker memory. Trino can spill to disk, but performance drops significantly.

Mitigations:

Adjust query.max-memory-per-node
Avoid SELECT *; specify only needed columns
Break large aggregations into stages

11. Performance Tips

Item	Setting/Command	Effect
Collect statistics	`ANALYZE events`	Better execution plans from Trino planner
Row group size	`write.parquet.row-group-size-bytes`	Tune row group pruning effectiveness
Sort key	`sorted_by = ARRAY['col']`	Narrow min/max ranges for better pruning
Compaction	`ALTER TABLE ... EXECUTE optimize`	Merge small files, apply sorting
Snapshot management	`EXECUTE expire_snapshots`	Reduce metadata size
Orphan file removal	`EXECUTE remove_orphan_files`	Reclaim unnecessary storage

12. Performance Degradation Diagnostic Checklist

When issues arise, check in this order:

Step	Check	How
1	Are columns wrapped in functions in WHERE?	Query review
2	Is file/partition pruning working?	`EXPLAIN (TYPE DISTRIBUTED)`
3	Too many small files?	`SELECT count(*), sum(file_size_in_bytes) FROM "table$files"`
4	Is data sorted?	Check `sorted_by` property
5	Too many snapshots?	`SELECT count(*) FROM "table$snapshots"`
6	Is Dynamic Filtering active?	Check for `dynamicFilter` in EXPLAIN
7	Sufficient Worker memory?	Trino Web UI → Query Detail → per-stage memory

13. Summary

Item	Hive/Impala	Trino + Iceberg
Partition structure	Directory-based, HMS-managed	Manifest-based, HMS-independent
Missing partition column	Full scan	Hidden partitioning enables auto-pruning
Partition count limit	Tens to hundreds of thousands	Effectively unlimited
File-level pruning	Impossible	Manifest min/max statistics
Partition strategy change	Requires table recreation	Partition Evolution (zero downtime)
Dynamic partition limit	Yes (default 1000)	None

The partition problem in Hive/Impala is not "user error" — it is a structural limitation of directory-based partitioning. Iceberg eliminates this limitation by bringing the metadata layer into the table format itself, and Trino maximizes the use of this metadata to minimize unnecessary I/O.

That said, Trino + Iceberg is not a cure-all. Small file problems, SQL patterns that break predicate pushdown, unsorted data, and snapshot accumulation can still degrade performance. The key is to verify execution plans with EXPLAIN and make regular compaction and snapshot management a habit.

This post was written based on Trino 443 and Iceberg spec v2. If you need help with partition design or migration in your environment, feel free to reach out.

— Data Dynamics Engineering Team