Why MinIO Beats HDFS — It's the Access Path, Not the Media
Same flash, same network, same Iceberg — so why is Trino+MinIO faster than Impala+HDFS? It isn't 'faster disks.' We break down the storage access-path architecture into seven concrete causes.
Lately we keep hearing the same report from customers: "Trino + MinIO (an S3-compatible object store) is faster than HDFS + Impala" — even when both sides use Iceberg as the table format.
It runs against intuition. Throughout the big-data era, HDFS justified its performance with one weapon: data locality — compute runs next to the data. So why would object storage now be faster?
This post deliberately removes every common explanation ("faster disks," "thanks to locality") and isolates the one thing that remains: the architecture of the path you take to reach storage.
First, control the variables
In engineering, "A is faster than B" is meaningless unless you state what was held equal and what was changed. Let's first pin down the confounders that usually sneak into this comparison.
| Item | Assumption in this analysis |
|---|---|
| Storage media | Both sides on flash (NVMe/SSD). "Faster disks" is removed. |
| Data access | Both accessed over the network. HDFS is deployed disaggregated too — no direct local-disk reads. "Thanks to locality" is removed. |
| Table format | Both on Iceberg. Commit semantics, rename cost, and other format variables removed. |
| What's left | The storage access path (HDFS block protocol vs S3 object protocol) and the query engine (Impala vs Trino). |
The point: hold hardware and locality equal, and "faster media" disappears as an answer. What remains is the path a single read has to travel.
Anatomy of a single read — 2-hop vs 1-hop
Put the act of opening the same Parquet file side by side on both paths, and the difference jumps out.
On HDFS you must go through the NameNode before the first byte arrives. Every file open triggers this metadata RPC, multiplied by the number of files in the scan. On S3/MinIO the object's location metadata is distributed, so a single byte-range request finishes the job.
Even with identical flash and network, HDFS pays one extra round-trip per file. And columnar analytics is a stream of many small reads — "read the file footer, then seek to the needed column chunks" — so this difference compounds thousands of times in a single query.
Seven reasons it's faster on identical hardware
1. NameNode = centralized metadata + a global lock
HDFS metadata lives in a single active NameNode (a JVM process) and is serialized behind the FSNamesystem global lock. As concurrent users grow (interactive BI, dozens of simultaneous queries), the NameNode's RPC queue becomes the bottleneck, and GC pauses on a large heap turn directly into p99 latency spikes.
Object storage has no such single serialization point by design — metadata is sharded across all nodes and scales horizontally. This is where the gap widens most on identical hardware, and it grows worse the higher the concurrency.
2. The "locality tax" Impala pays
Impala is a locality-aware engine: to place scan work on the node holding the data, it queries the NameNode for block locations. But in a disaggregated deployment that location info buys you essentially nothing — you read everything over the network anyway.
So Impala issues extra, useless NameNode lookups, and that load feeds right back into reason #1. It's not just that its favorite weapon (locality) is neutralized — the very attempt to swing it becomes a cost. Trino never attempts locality; it distributes splits round-robin and pays no such tax.
3. 3x replication hotspots vs erasure-coded striping
Classic HDFS stores each block whole on three DataNodes and reads from one replica (= one DataNode). A single block read is therefore bounded by one DataNode's disk and NIC.
MinIO erasure-codes an object into shards spread across many drives and nodes, so reading one object streams from several devices in parallel. Same flash, different degree of single-object parallelism.
HDFS has supported erasure coding since Hadoop 3, but it's rarely used for hot data and carries its own reconstruction-read cost. Most production clusters still run 3x replication.
4. Small files / metadata scalability
HDFS consumes roughly 150 bytes of NameNode heap per file/block. Millions of small files crush NameNode memory and listing. In object storage a small object is just a key-value entry in a distributed index — no central heap pressure.
Iceberg fetches the file list from manifests instead of a directory LIST, which relieves listing storms on both sides. But the open/stat to actually open a file still hits the NameNode on HDFS. The more small files, the more it folds back into reason #1.
5. A Range-GET I/O stack tailor-made for columnar formats
Modern connectors like Trino's native S3 filesystem are deeply optimized for async parallel range GETs, coalescing of adjacent small reads, prefetch, and footer/metadata caching. Parquet's "read footer → read multiple column chunks in parallel" pattern maps 1:1 onto HTTP Range GETs.
HDFS's DFSClient operates over a block abstraction with a NameNode dependency, leaving relatively little room to optimize these large-scale parallel range reads.
6. Round-trips during query planning
Building splits requires file size and location. Iceberg manifests already carry file sizes and statistics, so on the S3 path the storage round-trips needed for planning are nearly zero.
Impala, by contrast, tends to query the NameNode for block locations during planning too (for locality scheduling, per reason #2). That raises planning latency and NameNode load — and the shorter the query, the larger this fixed cost looms.
7. Serialization under concurrent access to hot files
When many queries open the same popular file at once (a small dimension table, a frequently joined file), HDFS funnels traffic into the few DataNodes holding that block plus the NameNode lock, creating a hotspot. Object storage spreads shards across nodes and has no central metadata lock, so concurrent reads distribute more evenly.
One thing to state honestly — the engine confounder
Reclassify the seven causes by their source:
| Cause | Nature |
|---|---|
| 1 · 3 · 4 · 7 | Pure storage layer (HDFS architecture vs object-storage architecture) |
| 2 · 6 | How the engine uses storage (Impala's locality design) |
| 5 | Connector I/O stack (maturity of Trino's native S3 filesystem) |
In other words, the "Trino + MinIO is faster" report still mixes in the Impala vs Trino engine variable. An honest conclusion can't hide that.
To cleanly isolate the storage effect, the ideal controlled experiment is to fix the engine to Trino and change only the storage.
Measured this way, you see exactly how much causes 1·3·4·7 (pure storage) contribute and how much was the engine's share. If you can pull this data from a customer environment, the answer to "why did it get faster" becomes decisively more convincing.
When HDFS is not at a disadvantage
To keep things balanced and avoid overstatement: the analysis above holds under disaggregated flash + Iceberg analytical queries. The gap can shrink or even reverse in these cases:
- Tiny reads dominated by per-request latency: workloads where the per-object HTTP round-trip overhead accumulates.
- Co-located deployments where locality is genuinely alive: the traditional setup with compute on the same nodes as HDFS DataNodes and short-circuit local reads working.
- Sequential scans of a few enormous files with little metadata overhead: cases where NameNode round-trips are a negligible fraction.
So the takeaway is not "HDFS is dead," but rather: in disaggregated, flash-backed, high-concurrency analytics, the HDFS access path's legacy costs no longer earn their keep.
Conclusion
With identical flash, identical network, and identical Iceberg, object storage is faster not because the media is faster — it's because of the access-path architecture.
- (a) Object storage eliminates the central NameNode metadata / global-lock bottleneck,
- (b) it structurally fits the Range-GET I/O pattern of columnar analytics, and
- (c) it parallelizes a single object across more devices.
The HDFS path, meanwhile, carries legacy costs that buy nothing in a disaggregated flash setup — the NameNode hop, the locality tax, replication hotspots. Add Trino's mature Iceberg and S3 connectors on top, and the perceived gap widens.
Next time you get a report that "object storage is faster than HDFS," before debating media or locality, ask first: "how many times does a single read have to go through the NameNode?" Most of the answer lives right there.