Trino Federation in Practice — Joining Heterogeneous Sources with a Single SQL Query
A hands-on look at federation, Trino's defining capability. We cover configuring the Hive, Iceberg, PostgreSQL, MySQL, and Kafka connectors, joining across catalogs, how predicate/aggregate pushdown works and where it breaks down, and patterns for keeping federated queries fast.
If you had to name the single feature that sets Trino apart from other SQL engines, it would be federation. You can join Iceberg tables in object storage, PostgreSQL in your operational database, legacy MySQL, and real-time Kafka topics in one SQL query. You analyze data "where it lives" instead of consolidating it with ETL.
This post covers federation from a practitioner's perspective — how to wire up connectors, how cross-catalog joins work, and most importantly, how pushdown works and where it breaks down, so you can avoid slow federated queries.
1. The Mental Model — Catalog = Data Source
Trino's naming scheme has three levels: catalog.schema.table. One catalog corresponds to one data source.
iceberg.analytics.events → Iceberg table in object storage
postgresql.crm.customers → operational PostgreSQL
mysql.legacy.orders → legacy MySQL
kafka.streams.clickstream → Kafka topicEach catalog is defined by a single etc/catalog/<name>.properties file. Simply adding a file makes a new data source appear on the SQL surface.
2. Connector Configuration — Properties per Source
Iceberg (Object Storage Lakehouse)
# etc/catalog/iceberg.properties
connector.name=iceberg
iceberg.catalog.type=rest
iceberg.rest-catalog.uri=http://iceberg-rest:8181
fs.native-s3.enabled=true
s3.endpoint=https://s3.ap-northeast-2.amazonaws.comHive (Existing Metastore + HDFS/S3)
# etc/catalog/hive.properties
connector.name=hive
hive.metastore.uri=thrift://metastore:9083
fs.native-s3.enabled=truePostgreSQL / MySQL (RDBMS)
# etc/catalog/postgresql.properties
connector.name=postgresql
connection-url=jdbc:postgresql://pg-host:5432/crm
connection-user=trino
connection-password=${ENV:PG_PASSWORD}# etc/catalog/mysql.properties
connector.name=mysql
connection-url=jdbc:mysql://mysql-host:3306
connection-user=trino
connection-password=${ENV:MYSQL_PASSWORD}Kafka (Real-Time Streams)
# etc/catalog/kafka.properties
connector.name=kafka
kafka.nodes=broker1:9092,broker2:9092
kafka.table-names=streams.clickstream
kafka.hide-internal-columns=false| Connector family | Representatives | Characteristics |
|---|---|---|
| Lakehouse | Iceberg, Delta, Hive, Hudi | Large-scale analytics, manifest-based pruning |
| RDBMS | PostgreSQL, MySQL, Oracle, SQL Server | Strong pushdown; watch out for load on operational DBs |
| NoSQL | MongoDB, Cassandra | Schema mapping |
| Search/Logs | Elasticsearch, OpenSearch | Text and logs |
| Streams | Kafka | Real-time topics as tables |
| Warehouse | BigQuery, Snowflake, Redshift | Cloud DW federation |
3. Joining Across Catalogs
Once the catalogs are attached, you simply join them in a single query.
-- Object storage events × operational DB customers × legacy orders
SELECT
c.tier,
count(DISTINCT e.user_id) AS active_users,
sum(o.amount) AS revenue
FROM iceberg.analytics.events e
JOIN postgresql.crm.customers c ON e.user_id = c.id
JOIN mysql.legacy.orders o ON o.user_id = c.id
WHERE e.event_time >= TIMESTAMP '2026-06-01 00:00:00 UTC'
AND c.status = 'active'
GROUP BY c.tier;Trino pulls data from each source and performs the join and aggregation on its own workers. You get real-time cross-source analysis without ETL pipelines moving data around.
Moving Data Between Sources (CTAS / INSERT)
Federation isn't just for reads — it also handles loading data across sources.
-- Snapshot an operational DB into the Lakehouse
CREATE TABLE iceberg.analytics.customer_snapshot AS
SELECT * FROM postgresql.crm.customers;
-- Incremental load: legacy MySQL → Iceberg
INSERT INTO iceberg.analytics.orders
SELECT * FROM mysql.legacy.orders
WHERE updated_at >= current_date - INTERVAL '1' DAY;This alone gives you a lightweight "DB → Lakehouse" ETL. (Real-time ingestion with tools like NiFi is covered in a separate post, "Real-Time Ingestion into Trino Iceberg Tables with NiFi.")
4. Pushdown — The Key to Federation Performance
If one concept determines federated query performance, it is pushdown: pushing WHERE filters, aggregations, sorts, and LIMIT down to the source, so that only pre-reduced data travels to Trino.
[no pushdown] all 100M rows in the RDBMS ──network──> Trino filters them (slow)
[with pushdown] RDBMS applies WHERE, sends 10K rows ──network──> Trino (fast)| Pushdown type | What gets pushed down | Effect |
|---|---|---|
| Predicate pushdown | WHERE conditions | Row filtering at the source |
| Projection pushdown | Only the required columns | Fewer columns transferred |
| Aggregate pushdown | count, sum, GROUP BY | Aggregation at the source |
| Limit / TopN pushdown | LIMIT, ORDER BY ... LIMIT | Top N at the source |
| Join pushdown | Joins within the same source | Join at the source (some connectors) |
This is especially powerful with RDBMS connectors — Trino translates WHERE clauses and aggregations into SQL and delegates them to the origin database, so the database's indexes get used too.
Verifying Pushdown
EXPLAIN
SELECT count(*) FROM postgresql.crm.customers WHERE status = 'active';In the plan, check whether the conditions and aggregations are attached to the TableScan node, meaning they were pushed down to the source. When an aggregation is pushed down, there is no separate Aggregate node on the Trino side — Trino simply receives the result the source has already aggregated.
5. When Pushdown Breaks Down
When pushdown breaks down, Trino has to pull the entire dataset from the source and process it itself, making queries drastically slower. Common causes:
| Pattern | Result | Alternative |
|---|---|---|
Wrapping a column in a function in WHERE (LOWER(name)='x') | Predicate pushdown breaks | Rewrite in a form the source understands, or apply transformations to the constant side |
| Using Trino-only functions in WHERE | Source doesn't know the function → can't push down | Replace with standard comparisons |
| Filter in a cross-source join exists on only one side | The other source gets pulled in wholesale | Filter both sides, leverage dynamic filtering |
| Connector doesn't support that aggregation | Aggregate pushdown fails | Pre-aggregated table |
-- BAD: not pushed down to PostgreSQL → pulls everything
WHERE date_trunc('day', created_at) = DATE '2026-06-01'
-- GOOD: range comparison → pushdown
WHERE created_at >= TIMESTAMP '2026-06-01 00:00:00'
AND created_at < TIMESTAMP '2026-06-02 00:00:00'6. Federated Joins Between Large Tables
When you join two large tables from different sources, even pushdown has its limits — both sides must be pulled into Trino and shuffled.
Strategies:
- Broadcast the smaller side: if the dimension table is small enough, replicate it to every worker (the CBO picks this automatically when statistics are available). (Join distribution is covered in a separate post, "A Deep Dive into the Trino Cost-Based Optimizer.")
- Dynamic Filtering: build a dynamic filter from the smaller (build) side's results to reduce the scan on the larger (probe) side. This works in federation too.
- Replicate frequently joined large operational DB tables into the Lakehouse: instead of hitting the operational DB every time, land a nightly snapshot in Iceberg and join against that. It also reduces load on the operational DB.
Principle: federation shines when you "need to pull the latest operational data in real time." If you join the same large table over and over, that's a signal for loading it into the Lakehouse (ELT) rather than federating.
7. Cautions When Federating Operational Databases
RDBMS connectors are powerful, but they can put load on operational databases.
- Point the connector at a read replica whenever possible, so analytics queries don't compete for resources with operational transactions.
- Combine the Trino-side
query.max-scan-physical-bytesguardrail with a statement timeout on the source DB, so a large full scan can't paralyze the operational database. - Limit the connector's concurrent connections (connection pool) so you don't exhaust the operational DB's connections.
8. Federation Performance Checklist
- Verify with
EXPLAINthat WHERE clauses and aggregations are pushed down to the source - Remove function wrapping of columns in WHERE clauses (use range comparisons)
- Collect statistics (
ANALYZE) on large dimension tables → broadcast is chosen automatically - Consider replicating repeatedly joined operational DB tables into the Lakehouse
- Connect operational DBs via read replicas, limit connection pools
- Confirm dynamic filtering is active (dynamicFilter in
EXPLAIN) - Set scan guardrails and source-side timeouts
9. Summary
| Concept | Key point |
|---|---|
| Catalog | 1 data source = 1 properties file |
| Joins | Just join catalog.schema.table |
| Cross-source loading | Lightweight ELT with CTAS/INSERT |
| Pushdown | The key to federation performance — filters and aggregations at the source |
| Avoiding breakage | No function wrapping; use standard comparisons |
| Large joins | Broadcast and dynamic filtering; if repeated, load into the Lakehouse |
| Operational DBs | Read replicas, load and connection guardrails |
Federation is Trino's most compelling feature, but its performance depends entirely on how much pushdown actually happens. The habit of checking with EXPLAIN whether conditions were pushed down to the source, writing SQL that avoids function wrapping, and the judgment that "a repeated large join is a signal to load, not federate" — these three are the keys to running federation fast in practice.
This post is based on the Trino 440 series. If you need help with unified analytics across heterogeneous data sources or designing a federation/ELT architecture, feel free to reach out.
— Data Dynamics Engineering Team