Blog
trinoicebergparquetperformance

How Trino + Iceberg Solves the Partition Problem

Why Hive/Impala triggers full table scans when partition columns are missing from WHERE clauses, and how Trino + Iceberg's hidden partitioning and metadata-based pruning structurally eliminates this problem.

Data DynamicsMay 30, 202616 min read

Building Parquet-formatted Iceberg tables in Trino and querying the same data from Impala and Hive is a common architecture. But have you ever experienced a multi-TB full scan that brought your cluster to a halt, simply because someone forgot to include the partition column in a WHERE clause on Impala or Hive? This post explains why Hive/Impala structurally cannot avoid this problem, and how Trino + Iceberg fundamentally eliminates it.

1. The Problem: Missing a Partition Column Kills Your Cluster

Here is the typical pattern for creating a partitioned table in Hive/Impala:

CREATE TABLE events (
  event_id    BIGINT,
  user_id     BIGINT,
  event_type  STRING,
  event_time  TIMESTAMP,
  city        STRING,
  payload     STRING
)
PARTITIONED BY (dt STRING)
STORED AS PARQUET;

Assume this table holds one year of data — approximately 500GB across 365 partitions.

-- OK: partition pruning works, scans only ~1.4GB (one day)
SELECT * FROM events WHERE dt = '2026-04-12' AND user_id = 12345;
 
-- PROBLEM: dt partition column missing → full scan of all 365 partitions (500GB)
SELECT * FROM events WHERE user_id = 12345;

The second query only has a user_id condition without dt. Hive/Impala has no way to know which partitions contain user_id = 12345, so it must open every file in every partition to find out.

2. Why Hive/Impala Cannot Avoid This

Directory-Based Partitioning

In Hive/Impala, a partition is a physical directory:

/warehouse/events/dt=2026-04-01/
/warehouse/events/dt=2026-04-02/
/warehouse/events/dt=2026-04-03/
...
/warehouse/events/dt=2026-04-12/
  ├── 000000_0.parquet
  ├── 000001_0.parquet
  └── 000002_0.parquet

The Hive Metastore (HMS) maintains the list of partition directories, but knows nothing about the data inside each file (such as min/max statistics). The query planner can only prune partitions if the partition column (dt) appears explicitly in the WHERE clause.

Strict Mode Blocks but Doesn't Solve

Hive's hive.mapred.mode=strict and Impala's query options can reject queries without partition columns. But this merely prevents mistakes — it does not enable efficient querying without partition columns.

3. Partition Count Limits: Hive/Impala vs Trino+Iceberg

Finer partitioning improves query performance, but in Hive/Impala, the partition count itself becomes a bottleneck.

Hive/Impala Partition Limits

ItemHiveImpala
Partition column countNo hard limit (3–4 recommended)Same
Max partition countHMS performance degrades at tens to hundreds of thousandsCatalogd loads all partition metadata in memory; OOM beyond hundreds of thousands
Dynamic partition creation limithive.exec.max.dynamic.partitions default 1000Controlled by MAX_DYNAMIC_PARTITIONS query option
Planning overheadRPC to HMS → query planning alone takes minutes with many partitionsAdditional HDFS NameNode + Catalogd sync cost

Combining 2–3 partition columns causes cardinality explosion. For example, dt × region × device_type = 365 × 50 × 10 = 182,500 partitions/year. This design is impractical in Hive/Impala.

Trino + Iceberg Partition Limits

ItemTrino + Iceberg
Partition column countNo limit per Iceberg spec
Max partition countEffectively unlimited. Partitions are metadata within manifest files, not HMS records
Dynamic partition creation limitNone
Planning overheadOnly reads manifest files; no HMS RPC required

Iceberg stores partition information in manifest files, not in HMS. As partition count grows, there are no RPCs to HMS. Even with millions of partitions, manifest pruning quickly filters to the relevant files.

Comparison Summary

ItemHive/ImpalaTrino + Iceberg
Partition =HMS record + HDFS directoryMetadata within manifests
Practical partition limitTens to hundreds of thousandsEffectively unlimited
Partition creation costHMS ALTER TABLE ADD PARTITION (RPC)Automatic on data write (manifest commit)
Partition count × planning timeLinear increaseNearly constant
Dynamic partition limitDefault 1000/queryNone

Practical implication: In Hive/Impala, hourly partitioning is risky. In Trino + Iceberg, hours(event_time) is no problem, and composite partitions like days(event_time), bucket(32, user_id) are effortless.

4. How Iceberg Differs: Metadata-Based Architecture

Iceberg manages table metadata in a three-layer structure:

Metadata file (JSON)
  └── Snapshot
        └── Manifest list (Avro)
              └── Manifest file (Avro)
                    └── Per-data-file information
                          ├── File path
                          ├── Partition values
                          ├── Row count
                          ├── Per-column min / max
                          ├── Null count
                          └── File size

The key is the manifest file. Each data file's partition values and per-column min/max statistics are stored in the manifest, so Trino can determine whether a file is relevant to the query without opening the data file.

Fundamental difference from Hive/Impala:

ItemHive/ImpalaIceberg
Metadata locationHMS (external RPC)Manifest files (direct storage read)
File-level statisticsNone (partition-level only)min/max, null count, row count, etc.
File pruningImpossibleManifest-based pruning

5. Hidden Partitioning — Users Don't Need to Know About Partitions

This is Iceberg's most significant differentiator.

Hive Approach

-- Partition column explicitly declared in schema
CREATE TABLE events (...) PARTITIONED BY (dt STRING);
 
-- INSERT requires explicit partition value
INSERT INTO events PARTITION (dt = '2026-04-12') VALUES (...);
 
-- SELECT must include partition column for pruning
SELECT * FROM events WHERE dt = '2026-04-12';

Users must know about the dt partition column and specify it in both INSERT and SELECT.

Iceberg Approach: Partition Transforms

-- No partition column exposed in schema
CREATE TABLE iceberg_catalog.db.events (
  event_id    BIGINT,
  user_id     BIGINT,
  event_type  VARCHAR,
  event_time  TIMESTAMP(6) WITH TIME ZONE,
  city        VARCHAR,
  payload     VARCHAR
)
WITH (
  format = 'PARQUET',
  partitioning = ARRAY['days(event_time)']
);

Day-level partitions are automatically derived from event_time. There is no separate dt partition column in the schema.

-- Users simply filter on event_time
SELECT * FROM events
WHERE event_time >= TIMESTAMP '2026-04-12 00:00:00 UTC'
  AND event_time <  TIMESTAMP '2026-04-13 00:00:00 UTC';

Trino knows about the days(event_time) partition transform and automatically scans only the relevant day's partition. Since there is no partition column in the schema, there is nothing to forget.

Supported Partition Transforms:

TransformDescriptionExample
years(col)Yearlyyears(event_time)
months(col)Monthlymonths(event_time)
days(col)Dailydays(event_time)
hours(col)Hourlyhours(event_time)
bucket(N, col)Hash into N bucketsbucket(32, user_id)
truncate(W, col)Truncate string/number to width Wtruncate(10, city)

6. Is a Partition Column Required in Trino + Iceberg?

Short Answer: No, But It Changes What Happens

Iceberg tables can be created and queried without any partitioning:

-- Iceberg table without partitioning
CREATE TABLE iceberg_catalog.db.events_no_partition (
  event_id    BIGINT,
  user_id     BIGINT,
  event_time  TIMESTAMP(6) WITH TIME ZONE,
  city        VARCHAR
)
WITH (format = 'PARQUET');

Querying with WHERE event_time = TIMESTAMP '2026-04-12 10:00:00 UTC':

  1. No partition values in manifest → all manifest entries must be checked
  2. But Parquet footer column statistics (min/max) can skip some row groups
  3. Result: All files are opened, but some row groups are skipped → better than Hive full scan, but much slower than with partitioning
ConfigurationFile PruningRow Group PruningScan Scope
No partitioningImpossiblePartial skip via Parquet min/maxNearly everything
With partitioning (hidden)Most files skipped via manifestAdditional row group skippingOnly relevant partition files

What Happens When WHERE Uses a Non-Partition Column?

If partitioning is days(event_time) but the query is WHERE user_id = 12345:

  1. Manifest pruning: user_id is unrelated to the partition transform → cannot reduce files → all data files are candidates
  2. Parquet min/max pruning: Check each file's user_id column min/max → skip files where 12345 is out of range
  3. Actual reads: Only open files where 12345 falls within the min/max range, then filter at row group level

Result: Zero partition pruning effect; relies entirely on Parquet statistics. If data is unsorted, this is effectively a full scan.

Mitigations:

-- Option 1: Add bucket partitioning for user_id
ALTER TABLE events SET PROPERTIES
  partitioning = ARRAY['days(event_time)', 'bucket(64, user_id)'];
 
-- Option 2: Set sort key to maximize min/max statistics effectiveness
ALTER TABLE events SET PROPERTIES
  sorted_by = ARRAY['user_id'];
 
ALTER TABLE events EXECUTE optimize;

What Happens When WHERE References a Non-Existent Column?

SELECT * FROM events WHERE non_existent_column = 'value';
-- Result: COLUMN_NOT_FOUND error, query fails immediately

The column doesn't exist in the Iceberg schema, so the query is rejected at the planning stage. This behavior is the same in Hive/Impala.

However, with Schema Evolution, columns added later will return NULL for data files that predate the column addition. The query itself succeeds.

7. Trino's File Pruning Mechanisms

Trino + Iceberg filters unnecessary data through a three-stage funnel:

[All data files]


  ① Manifest-based pruning
  (partition values + file-level min/max)
       │  Remove unnecessary files

  ② Parquet Row Group pruning
  (per-row-group column statistics)
       │  Skip unnecessary row groups

  ③ Dynamic Filtering
  (propagate JOIN build side results to probe side)
       │  Further skip files/row groups

  [Actually read data]

Each stage progressively reduces the data volume, resulting in dramatically less I/O.

8. Practical Comparison: Same Data, Same Query, Different Results

Scenario Setup

  • Table: One year of event logs, ~500GB
  • Partitioning: Daily (Hive: dt column, Iceberg: days(event_time))
  • Parquet format

Query 1: Partition Column Specified

-- Hive/Impala
SELECT * FROM events WHERE dt = '2026-04-12' AND user_id = 12345;
 
-- Trino + Iceberg
SELECT * FROM events
WHERE event_time >= TIMESTAMP '2026-04-12 00:00:00 UTC'
  AND event_time <  TIMESTAMP '2026-04-13 00:00:00 UTC'
  AND user_id = 12345;
EngineScan ScopeNotes
Hive/Impala~1.4GB (1 day)Partition pruning works
Trino + Iceberg~1.4GB (1 day)Hidden partition pruning works

Both are fast.

Query 2: Partition Column Missing — Time Range Only

-- Hive/Impala: using event_time instead of dt
SELECT * FROM events WHERE event_time >= '2026-04-12' AND event_time < '2026-04-13';
 
-- Trino + Iceberg: same condition
SELECT * FROM events
WHERE event_time >= TIMESTAMP '2026-04-12 00:00:00 UTC'
  AND event_time <  TIMESTAMP '2026-04-13 00:00:00 UTC';
EngineScan ScopeNotes
Hive/Impala~500GB (full)event_time is not a partition column → full scan
Trino + Iceberg~1.4GB (1 day)days(event_time) transform enables auto-pruning

350x difference. Full scan in Hive/Impala; one day's data in Trino + Iceberg.

Query 3: Column Unrelated to Partition

SELECT * FROM events WHERE user_id = 12345;
EngineScan ScopeNotes
Hive/Impala~500GB (full)Full scan
Trino + Iceberg (partition: days(event_time) only)~500GB (full)user_id unrelated to partition; relies on Parquet min/max only
Trino + Iceberg (partition: days(event_time) + bucket(64, user_id))~7.8GB (1/64)Bucket partition pruning

Partition design must match query patterns to be effective.

Query 4: Non-Partitioned Iceberg Table

SELECT * FROM events_no_partition
WHERE event_time >= TIMESTAMP '2026-04-12 00:00:00 UTC'
  AND event_time <  TIMESTAMP '2026-04-13 00:00:00 UTC';
EngineScan ScopeNotes
Trino + Iceberg (no partition)Nearly everythingRow group skip via Parquet min/max; no file pruning
Trino + Iceberg (with partition)~1.4GBMost files skipped via partition pruning

Non-partitioned Iceberg is slightly better than a Hive full scan, but there is a 10–50x performance gap compared to having partitions.

9. Iceberg Table Creation and Partition Configuration

Basic Creation

CREATE TABLE iceberg_catalog.db.events (
  event_id    BIGINT,
  user_id     BIGINT,
  event_type  VARCHAR,
  event_time  TIMESTAMP(6) WITH TIME ZONE,
  city        VARCHAR,
  payload     VARCHAR
)
WITH (
  format = 'PARQUET',
  partitioning = ARRAY['days(event_time)']
);

Composite Partitioning

CREATE TABLE iceberg_catalog.db.events (
  event_id    BIGINT,
  user_id     BIGINT,
  event_type  VARCHAR,
  event_time  TIMESTAMP(6) WITH TIME ZONE,
  city        VARCHAR
)
WITH (
  format = 'PARQUET',
  partitioning = ARRAY['days(event_time)', 'bucket(32, user_id)']
);

Partition Evolution — Changing Partition Strategy Without Rewriting Data

This is an Iceberg-specific capability not available in Hive/Impala.

-- Before: daily partitions
-- After: switch to hourly partitions
ALTER TABLE events SET PROPERTIES
  partitioning = ARRAY['hours(event_time)'];

Only newly written data receives hourly partitioning; existing data retains daily partitions. Iceberg's manifest tracks partition spec versions, so pruning works correctly even with mixed partition specs.

10. When Trino + Iceberg Performance Degrades

Trino + Iceberg is not a silver bullet. Performance degrades significantly in these situations.

10.1 Small File Problem

With frequent INSERTs, hundreds to thousands of small files accumulate per partition. Each file has a manifest entry, and each Parquet footer must be read, resulting in file count × I/O overhead.

Symptom: Small data volume but query planning takes tens of seconds.

-- Compact small files
ALTER TABLE events EXECUTE optimize;
 
-- Or compact specific partitions only
ALTER TABLE events EXECUTE optimize
  WHERE event_time >= TIMESTAMP '2026-04-01 00:00:00 UTC'
    AND event_time <  TIMESTAMP '2026-05-01 00:00:00 UTC';

10.2 Manifest Bloat

When file count exceeds hundreds of thousands, reading manifests themselves becomes slow.

-- Check file count
SELECT count(*) AS file_count,
       sum(file_size_in_bytes) / (1024*1024*1024) AS total_gb
FROM "events$files";

10.3 WHERE Conditions That Break Predicate Pushdown

Wrapping columns in functions in the WHERE clause breaks predicate pushdown.

PatternPushdownNotes
WHERE event_time = TIMESTAMP '...'YesDirect comparison
WHERE CAST(event_time AS DATE) = DATE '...'NoFunction applied → full scan
WHERE year(event_time) = 2026NoFunction wrapper → full scan
WHERE LOWER(city) = 'seoul'NoString function → full scan
WHERE city LIKE '%eoul'NoLeading wildcard → min/max unusable
WHERE city IN ('Seoul', 'Busan')YesIN list → pushdown works
WHERE event_time BETWEEN ... AND ...YesRange → pushdown + partition pruning

Key rule: Keep the column bare on the left side; apply transformations on the right (constant) side.

-- BAD: no pushdown
WHERE CAST(event_time AS DATE) = DATE '2026-04-12'
 
-- GOOD: pushdown works
WHERE event_time >= TIMESTAMP '2026-04-12 00:00:00 UTC'
  AND event_time <  TIMESTAMP '2026-04-13 00:00:00 UTC'

10.4 Unsorted Data

Parquet row group min/max statistics are effective only when data is sorted.

  • Sorted: Row group 1 has user_id 1–1000, row group 2 has 1001–2000WHERE user_id = 500 reads only row group 1
  • Unsorted: All row groups have user_id 1–100000 → must read everything
-- Set sort key
ALTER TABLE events SET PROPERTIES
  sorted_by = ARRAY['user_id'];
 
-- Re-sort existing data
ALTER TABLE events EXECUTE optimize;

10.5 Dynamic Filtering Not Working in JOINs

Dynamic Filtering fails when:

  • Build side result is too large (hundreds of thousands of rows) → Trino abandons dynamic filter generation
  • LEFT JOIN / FULL OUTER JOIN → probe side cannot be reduced
  • JOIN key is unrelated to partition columns

10.6 Excessive Snapshot Accumulation

Iceberg creates a new snapshot on every write. Over time, thousands of snapshots accumulate, increasing metadata file size and slowing planning.

-- Expire old snapshots
ALTER TABLE events EXECUTE expire_snapshots(retention_threshold => '7d');
 
-- Remove orphan files
ALTER TABLE events EXECUTE remove_orphan_files(retention_threshold => '7d');

10.7 Worker Memory Exhaustion

Regardless of partition or file count, aggregating or joining large data volumes can exceed Worker memory. Trino can spill to disk, but performance drops significantly.

Mitigations:

  • Adjust query.max-memory-per-node
  • Avoid SELECT *; specify only needed columns
  • Break large aggregations into stages

11. Performance Tips

ItemSetting/CommandEffect
Collect statisticsANALYZE eventsBetter execution plans from Trino planner
Row group sizewrite.parquet.row-group-size-bytesTune row group pruning effectiveness
Sort keysorted_by = ARRAY['col']Narrow min/max ranges for better pruning
CompactionALTER TABLE ... EXECUTE optimizeMerge small files, apply sorting
Snapshot managementEXECUTE expire_snapshotsReduce metadata size
Orphan file removalEXECUTE remove_orphan_filesReclaim unnecessary storage

12. Performance Degradation Diagnostic Checklist

When issues arise, check in this order:

StepCheckHow
1Are columns wrapped in functions in WHERE?Query review
2Is file/partition pruning working?EXPLAIN (TYPE DISTRIBUTED)
3Too many small files?SELECT count(*), sum(file_size_in_bytes) FROM "table$files"
4Is data sorted?Check sorted_by property
5Too many snapshots?SELECT count(*) FROM "table$snapshots"
6Is Dynamic Filtering active?Check for dynamicFilter in EXPLAIN
7Sufficient Worker memory?Trino Web UI → Query Detail → per-stage memory

13. Summary

ItemHive/ImpalaTrino + Iceberg
Partition structureDirectory-based, HMS-managedManifest-based, HMS-independent
Missing partition columnFull scanHidden partitioning enables auto-pruning
Partition count limitTens to hundreds of thousandsEffectively unlimited
File-level pruningImpossibleManifest min/max statistics
Partition strategy changeRequires table recreationPartition Evolution (zero downtime)
Dynamic partition limitYes (default 1000)None

The partition problem in Hive/Impala is not "user error" — it is a structural limitation of directory-based partitioning. Iceberg eliminates this limitation by bringing the metadata layer into the table format itself, and Trino maximizes the use of this metadata to minimize unnecessary I/O.

That said, Trino + Iceberg is not a cure-all. Small file problems, SQL patterns that break predicate pushdown, unsorted data, and snapshot accumulation can still degrade performance. The key is to verify execution plans with EXPLAIN and make regular compaction and snapshot management a habit.


This post was written based on Trino 443 and Iceberg spec v2. If you need help with partition design or migration in your environment, feel free to reach out.

— Data Dynamics Engineering Team