trinofederationconnectorpushdowndata-platform

Trino Federation in Practice — Joining Heterogeneous Sources with a Single SQL Query

A hands-on look at federation, Trino's defining capability. We cover configuring the Hive, Iceberg, PostgreSQL, MySQL, and Kafka connectors, joining across catalogs, how predicate/aggregate pushdown works and where it breaks down, and patterns for keeping federated queries fast.

Data DynamicsJune 5, 20268 min read

If you had to name the single feature that sets Trino apart from other SQL engines, it would be federation. You can join Iceberg tables in object storage, PostgreSQL in your operational database, legacy MySQL, and real-time Kafka topics in one SQL query. You analyze data "where it lives" instead of consolidating it with ETL.

This post covers federation from a practitioner's perspective — how to wire up connectors, how cross-catalog joins work, and most importantly, how pushdown works and where it breaks down, so you can avoid slow federated queries.

1. The Mental Model — Catalog = Data Source

Trino's naming scheme has three levels: catalog.schema.table. One catalog corresponds to one data source.

Loading diagram…

Each catalog is defined by a single etc/catalog/<name>.properties file. Simply adding a file makes a new data source appear on the SQL surface.

2. Connector Configuration — Properties per Source

Iceberg (Object Storage Lakehouse)

# etc/catalog/iceberg.properties
connector.name=iceberg
iceberg.catalog.type=rest
iceberg.rest-catalog.uri=http://iceberg-rest:8181
fs.native-s3.enabled=true
s3.endpoint=https://s3.ap-northeast-2.amazonaws.com

Hive (Existing Metastore + HDFS/S3)

# etc/catalog/hive.properties
connector.name=hive
hive.metastore.uri=thrift://metastore:9083
fs.native-s3.enabled=true

PostgreSQL / MySQL (RDBMS)

# etc/catalog/postgresql.properties
connector.name=postgresql
connection-url=jdbc:postgresql://pg-host:5432/crm
connection-user=trino
connection-password=${ENV:PG_PASSWORD}

# etc/catalog/mysql.properties
connector.name=mysql
connection-url=jdbc:mysql://mysql-host:3306
connection-user=trino
connection-password=${ENV:MYSQL_PASSWORD}

Kafka (Real-Time Streams)

# etc/catalog/kafka.properties
connector.name=kafka
kafka.nodes=broker1:9092,broker2:9092
kafka.table-names=streams.clickstream
kafka.hide-internal-columns=false

Connector family	Representatives	Characteristics
Lakehouse	Iceberg, Delta, Hive, Hudi	Large-scale analytics, manifest-based pruning
RDBMS	PostgreSQL, MySQL, Oracle, SQL Server	Strong pushdown; watch out for load on operational DBs
NoSQL	MongoDB, Cassandra	Schema mapping
Search/Logs	Elasticsearch, OpenSearch	Text and logs
Streams	Kafka	Real-time topics as tables
Warehouse	BigQuery, Snowflake, Redshift	Cloud DW federation

3. Joining Across Catalogs

Once the catalogs are attached, you simply join them in a single query.

-- Object storage events × operational DB customers × legacy orders
SELECT
  c.tier,
  count(DISTINCT e.user_id) AS active_users,
  sum(o.amount)             AS revenue
FROM iceberg.analytics.events e
JOIN postgresql.crm.customers c ON e.user_id = c.id
JOIN mysql.legacy.orders     o ON o.user_id = c.id
WHERE e.event_time >= TIMESTAMP '2026-06-01 00:00:00 UTC'
  AND c.status = 'active'
GROUP BY c.tier;

Trino pulls data from each source and performs the join and aggregation on its own workers. You get real-time cross-source analysis without ETL pipelines moving data around.

Moving Data Between Sources (CTAS / INSERT)

Federation isn't just for reads — it also handles loading data across sources.

-- Snapshot an operational DB into the Lakehouse
CREATE TABLE iceberg.analytics.customer_snapshot AS
SELECT * FROM postgresql.crm.customers;
 
-- Incremental load: legacy MySQL → Iceberg
INSERT INTO iceberg.analytics.orders
SELECT * FROM mysql.legacy.orders
WHERE updated_at >= current_date - INTERVAL '1' DAY;

This alone gives you a lightweight "DB → Lakehouse" ETL. (Real-time ingestion with tools like NiFi is covered in a separate post, "Real-Time Ingestion into Trino Iceberg Tables with NiFi.")

4. Pushdown — The Key to Federation Performance

If one concept determines federated query performance, it is pushdown: pushing WHERE filters, aggregations, sorts, and LIMIT down to the source, so that only pre-reduced data travels to Trino.

Loading diagram…

Pushdown type	What gets pushed down	Effect
Predicate pushdown	`WHERE` conditions	Row filtering at the source
Projection pushdown	Only the required columns	Fewer columns transferred
Aggregate pushdown	`count`, `sum`, `GROUP BY`	Aggregation at the source
Limit / TopN pushdown	`LIMIT`, `ORDER BY ... LIMIT`	Top N at the source
Join pushdown	Joins within the same source	Join at the source (some connectors)

This is especially powerful with RDBMS connectors — Trino translates WHERE clauses and aggregations into SQL and delegates them to the origin database, so the database's indexes get used too.

Verifying Pushdown

EXPLAIN
SELECT count(*) FROM postgresql.crm.customers WHERE status = 'active';

In the plan, check whether the conditions and aggregations are attached to the TableScan node, meaning they were pushed down to the source. When an aggregation is pushed down, there is no separate Aggregate node on the Trino side — Trino simply receives the result the source has already aggregated.

5. When Pushdown Breaks Down

When pushdown breaks down, Trino has to pull the entire dataset from the source and process it itself, making queries drastically slower. Common causes:

Pattern	Result	Alternative
Wrapping a column in a function in WHERE (`LOWER(name)='x'`)	Predicate pushdown breaks	Rewrite in a form the source understands, or apply transformations to the constant side
Using Trino-only functions in WHERE	Source doesn't know the function → can't push down	Replace with standard comparisons
Filter in a cross-source join exists on only one side	The other source gets pulled in wholesale	Filter both sides, leverage dynamic filtering
Connector doesn't support that aggregation	Aggregate pushdown fails	Pre-aggregated table

-- BAD: not pushed down to PostgreSQL → pulls everything
WHERE date_trunc('day', created_at) = DATE '2026-06-01'
 
-- GOOD: range comparison → pushdown
WHERE created_at >= TIMESTAMP '2026-06-01 00:00:00'
  AND created_at <  TIMESTAMP '2026-06-02 00:00:00'

6. Federated Joins Between Large Tables

When you join two large tables from different sources, even pushdown has its limits — both sides must be pulled into Trino and shuffled.

Strategies:

Broadcast the smaller side: if the dimension table is small enough, replicate it to every worker (the CBO picks this automatically when statistics are available). (Join distribution is covered in a separate post, "A Deep Dive into the Trino Cost-Based Optimizer.")
Dynamic Filtering: build a dynamic filter from the smaller (build) side's results to reduce the scan on the larger (probe) side. This works in federation too.
Replicate frequently joined large operational DB tables into the Lakehouse: instead of hitting the operational DB every time, land a nightly snapshot in Iceberg and join against that. It also reduces load on the operational DB.

Principle: federation shines when you "need to pull the latest operational data in real time." If you join the same large table over and over, that's a signal for loading it into the Lakehouse (ELT) rather than federating.

7. Cautions When Federating Operational Databases

RDBMS connectors are powerful, but they can put load on operational databases.

Point the connector at a read replica whenever possible, so analytics queries don't compete for resources with operational transactions.
Combine the Trino-side query.max-scan-physical-bytes guardrail with a statement timeout on the source DB, so a large full scan can't paralyze the operational database.
Limit the connector's concurrent connections (connection pool) so you don't exhaust the operational DB's connections.

8. Federation Performance Checklist

Verify with EXPLAIN that WHERE clauses and aggregations are pushed down to the source
Remove function wrapping of columns in WHERE clauses (use range comparisons)
Collect statistics (ANALYZE) on large dimension tables → broadcast is chosen automatically
Consider replicating repeatedly joined operational DB tables into the Lakehouse
Connect operational DBs via read replicas, limit connection pools
Confirm dynamic filtering is active (dynamicFilter in EXPLAIN)
Set scan guardrails and source-side timeouts

9. Summary

Concept	Key point
Catalog	1 data source = 1 properties file
Joins	Just join `catalog.schema.table`
Cross-source loading	Lightweight ELT with CTAS/INSERT
Pushdown	The key to federation performance — filters and aggregations at the source
Avoiding breakage	No function wrapping; use standard comparisons
Large joins	Broadcast and dynamic filtering; if repeated, load into the Lakehouse
Operational DBs	Read replicas, load and connection guardrails

Federation is Trino's most compelling feature, but its performance depends entirely on how much pushdown actually happens. The habit of checking with EXPLAIN whether conditions were pushed down to the source, writing SQL that avoids function wrapping, and the judgment that "a repeated large join is a signal to load, not federate" — these three are the keys to running federation fast in practice.

This post is based on the Trino 440 series. If you need help with unified analytics across heterogeneous data sources or designing a federation/ELT architecture, feel free to reach out.

— Data Dynamics Engineering Team