Blog
trinofederationconnectorpushdowndata-platform

Trino Federation in Practice — Joining Heterogeneous Sources with a Single SQL Query

A hands-on look at federation, Trino's defining capability. We cover configuring the Hive, Iceberg, PostgreSQL, MySQL, and Kafka connectors, joining across catalogs, how predicate/aggregate pushdown works and where it breaks down, and patterns for keeping federated queries fast.

Data DynamicsJune 5, 20268 min read

If you had to name the single feature that sets Trino apart from other SQL engines, it would be federation. You can join Iceberg tables in object storage, PostgreSQL in your operational database, legacy MySQL, and real-time Kafka topics in one SQL query. You analyze data "where it lives" instead of consolidating it with ETL.

This post covers federation from a practitioner's perspective — how to wire up connectors, how cross-catalog joins work, and most importantly, how pushdown works and where it breaks down, so you can avoid slow federated queries.

1. The Mental Model — Catalog = Data Source

Trino's naming scheme has three levels: catalog.schema.table. One catalog corresponds to one data source.

iceberg.analytics.events       → Iceberg table in object storage
postgresql.crm.customers       → operational PostgreSQL
mysql.legacy.orders            → legacy MySQL
kafka.streams.clickstream      → Kafka topic

Each catalog is defined by a single etc/catalog/<name>.properties file. Simply adding a file makes a new data source appear on the SQL surface.

2. Connector Configuration — Properties per Source

Iceberg (Object Storage Lakehouse)

# etc/catalog/iceberg.properties
connector.name=iceberg
iceberg.catalog.type=rest
iceberg.rest-catalog.uri=http://iceberg-rest:8181
fs.native-s3.enabled=true
s3.endpoint=https://s3.ap-northeast-2.amazonaws.com

Hive (Existing Metastore + HDFS/S3)

# etc/catalog/hive.properties
connector.name=hive
hive.metastore.uri=thrift://metastore:9083
fs.native-s3.enabled=true

PostgreSQL / MySQL (RDBMS)

# etc/catalog/postgresql.properties
connector.name=postgresql
connection-url=jdbc:postgresql://pg-host:5432/crm
connection-user=trino
connection-password=${ENV:PG_PASSWORD}
# etc/catalog/mysql.properties
connector.name=mysql
connection-url=jdbc:mysql://mysql-host:3306
connection-user=trino
connection-password=${ENV:MYSQL_PASSWORD}

Kafka (Real-Time Streams)

# etc/catalog/kafka.properties
connector.name=kafka
kafka.nodes=broker1:9092,broker2:9092
kafka.table-names=streams.clickstream
kafka.hide-internal-columns=false
Connector familyRepresentativesCharacteristics
LakehouseIceberg, Delta, Hive, HudiLarge-scale analytics, manifest-based pruning
RDBMSPostgreSQL, MySQL, Oracle, SQL ServerStrong pushdown; watch out for load on operational DBs
NoSQLMongoDB, CassandraSchema mapping
Search/LogsElasticsearch, OpenSearchText and logs
StreamsKafkaReal-time topics as tables
WarehouseBigQuery, Snowflake, RedshiftCloud DW federation

3. Joining Across Catalogs

Once the catalogs are attached, you simply join them in a single query.

-- Object storage events × operational DB customers × legacy orders
SELECT
  c.tier,
  count(DISTINCT e.user_id) AS active_users,
  sum(o.amount)             AS revenue
FROM iceberg.analytics.events e
JOIN postgresql.crm.customers c ON e.user_id = c.id
JOIN mysql.legacy.orders     o ON o.user_id = c.id
WHERE e.event_time >= TIMESTAMP '2026-06-01 00:00:00 UTC'
  AND c.status = 'active'
GROUP BY c.tier;

Trino pulls data from each source and performs the join and aggregation on its own workers. You get real-time cross-source analysis without ETL pipelines moving data around.

Moving Data Between Sources (CTAS / INSERT)

Federation isn't just for reads — it also handles loading data across sources.

-- Snapshot an operational DB into the Lakehouse
CREATE TABLE iceberg.analytics.customer_snapshot AS
SELECT * FROM postgresql.crm.customers;
 
-- Incremental load: legacy MySQL → Iceberg
INSERT INTO iceberg.analytics.orders
SELECT * FROM mysql.legacy.orders
WHERE updated_at >= current_date - INTERVAL '1' DAY;

This alone gives you a lightweight "DB → Lakehouse" ETL. (Real-time ingestion with tools like NiFi is covered in a separate post, "Real-Time Ingestion into Trino Iceberg Tables with NiFi.")

4. Pushdown — The Key to Federation Performance

If one concept determines federated query performance, it is pushdown: pushing WHERE filters, aggregations, sorts, and LIMIT down to the source, so that only pre-reduced data travels to Trino.

[no pushdown]   all 100M rows in the RDBMS ──network──> Trino filters them  (slow)
[with pushdown] RDBMS applies WHERE, sends 10K rows ──network──> Trino      (fast)
Pushdown typeWhat gets pushed downEffect
Predicate pushdownWHERE conditionsRow filtering at the source
Projection pushdownOnly the required columnsFewer columns transferred
Aggregate pushdowncount, sum, GROUP BYAggregation at the source
Limit / TopN pushdownLIMIT, ORDER BY ... LIMITTop N at the source
Join pushdownJoins within the same sourceJoin at the source (some connectors)

This is especially powerful with RDBMS connectors — Trino translates WHERE clauses and aggregations into SQL and delegates them to the origin database, so the database's indexes get used too.

Verifying Pushdown

EXPLAIN
SELECT count(*) FROM postgresql.crm.customers WHERE status = 'active';

In the plan, check whether the conditions and aggregations are attached to the TableScan node, meaning they were pushed down to the source. When an aggregation is pushed down, there is no separate Aggregate node on the Trino side — Trino simply receives the result the source has already aggregated.

5. When Pushdown Breaks Down

When pushdown breaks down, Trino has to pull the entire dataset from the source and process it itself, making queries drastically slower. Common causes:

PatternResultAlternative
Wrapping a column in a function in WHERE (LOWER(name)='x')Predicate pushdown breaksRewrite in a form the source understands, or apply transformations to the constant side
Using Trino-only functions in WHERESource doesn't know the function → can't push downReplace with standard comparisons
Filter in a cross-source join exists on only one sideThe other source gets pulled in wholesaleFilter both sides, leverage dynamic filtering
Connector doesn't support that aggregationAggregate pushdown failsPre-aggregated table
-- BAD: not pushed down to PostgreSQL → pulls everything
WHERE date_trunc('day', created_at) = DATE '2026-06-01'
 
-- GOOD: range comparison → pushdown
WHERE created_at >= TIMESTAMP '2026-06-01 00:00:00'
  AND created_at <  TIMESTAMP '2026-06-02 00:00:00'

6. Federated Joins Between Large Tables

When you join two large tables from different sources, even pushdown has its limits — both sides must be pulled into Trino and shuffled.

Strategies:

  • Broadcast the smaller side: if the dimension table is small enough, replicate it to every worker (the CBO picks this automatically when statistics are available). (Join distribution is covered in a separate post, "A Deep Dive into the Trino Cost-Based Optimizer.")
  • Dynamic Filtering: build a dynamic filter from the smaller (build) side's results to reduce the scan on the larger (probe) side. This works in federation too.
  • Replicate frequently joined large operational DB tables into the Lakehouse: instead of hitting the operational DB every time, land a nightly snapshot in Iceberg and join against that. It also reduces load on the operational DB.

Principle: federation shines when you "need to pull the latest operational data in real time." If you join the same large table over and over, that's a signal for loading it into the Lakehouse (ELT) rather than federating.

7. Cautions When Federating Operational Databases

RDBMS connectors are powerful, but they can put load on operational databases.

  • Point the connector at a read replica whenever possible, so analytics queries don't compete for resources with operational transactions.
  • Combine the Trino-side query.max-scan-physical-bytes guardrail with a statement timeout on the source DB, so a large full scan can't paralyze the operational database.
  • Limit the connector's concurrent connections (connection pool) so you don't exhaust the operational DB's connections.

8. Federation Performance Checklist

  • Verify with EXPLAIN that WHERE clauses and aggregations are pushed down to the source
  • Remove function wrapping of columns in WHERE clauses (use range comparisons)
  • Collect statistics (ANALYZE) on large dimension tables → broadcast is chosen automatically
  • Consider replicating repeatedly joined operational DB tables into the Lakehouse
  • Connect operational DBs via read replicas, limit connection pools
  • Confirm dynamic filtering is active (dynamicFilter in EXPLAIN)
  • Set scan guardrails and source-side timeouts

9. Summary

ConceptKey point
Catalog1 data source = 1 properties file
JoinsJust join catalog.schema.table
Cross-source loadingLightweight ELT with CTAS/INSERT
PushdownThe key to federation performance — filters and aggregations at the source
Avoiding breakageNo function wrapping; use standard comparisons
Large joinsBroadcast and dynamic filtering; if repeated, load into the Lakehouse
Operational DBsRead replicas, load and connection guardrails

Federation is Trino's most compelling feature, but its performance depends entirely on how much pushdown actually happens. The habit of checking with EXPLAIN whether conditions were pushed down to the source, writing SQL that avoids function wrapping, and the judgment that "a repeated large join is a signal to load, not federate" — these three are the keys to running federation fast in practice.


This post is based on the Trino 440 series. If you need help with unified analytics across heterogeneous data sources or designing a federation/ELT architecture, feel free to reach out.

— Data Dynamics Engineering Team