Trino Observability — JMX Metrics, the Web UI, and Diagnosing Slow Queries
What should you monitor to run production Trino reliably? This post covers JMX metrics and Prometheus scraping, query analysis with the Web UI, using system.runtime tables, query auditing and history capture with event listeners, and a step-by-step procedure for diagnosing slow queries.
When a Trino cluster slows down or queries start failing, you can only operate it effectively if you can answer "why?" Without observability, you end up chasing user complaints and responding on guesswork. Fortunately, Trino ships with a rich set of observability tools out of the box — JMX metrics, the Web UI, system tables, and event listeners.
This post covers what to monitor, how to collect and visualize it, and the order in which to dig in when you hit a slow query.
1. The Four Layers of Observability
(1) JMX metrics → Time series and alerting via Prometheus/Grafana (cluster health)
(2) Web UI → Real-time query and stage analysis (single-query diagnosis)
(3) system tables → Current state and history via SQL (automation and reporting)
(4) Event listeners → Capture query start/completion events (auditing, long-term analysis)| Layer | Question it answers | Tool |
|---|---|---|
| JMX | Is the cluster healthy? What are the trends? | Prometheus, Grafana |
| Web UI | Why is this query slow? | Coordinator Web UI |
| system tables | What is running right now? | system.runtime.* |
| Event listeners | Who ran what queries yesterday? | Event listener → external storage |
2. JMX Metrics and Prometheus
Trino exposes its internal state via JMX, and also provides a jmx catalog that lets you query JMX with SQL. In production, the typical setup is scrape with Prometheus → visualize with Grafana → alert with Alertmanager.
The coordinator can export Prometheus-format metrics via the /metrics endpoint (or a JMX exporter).
# Example Prometheus scrape config
scrape_configs:
- job_name: trino
metrics_path: /metrics
static_configs:
- targets: ['trino-coordinator:8080']Key Metrics You Must Watch
| Metric area | What to look at | Alert signal |
|---|---|---|
| Query counts | running / queued / blocked | queued keeps growing → resource shortage |
| Query outcomes | completed / failed ratio | failure rate spikes |
| Cluster memory | usage / reservation | approaching the limit → OOM risk |
| Worker node count | active workers | dropping → discovery or node failure |
| CPU | worker CPU utilization | sustained saturation → add capacity / scale |
| GC | GC pause time and frequency | getting longer → heap pressure |
| Spooling (FTE) | exchange I/O | spiking → excessive retries |
We recommend putting queued query count, cluster memory, and worker node count at the top of your dashboard — most incidents show up in these three metrics first.
Ad-hoc Queries with the jmx Catalog
-- Cluster memory pool state via SQL
SELECT node, freebytes, maxbytes
FROM jmx.current."trino.memory:name=general,type=memorypool";3. The Web UI — the Core Tool for Single-Query Diagnosis
The coordinator Web UI (port 8080/8443 by default) provides a real-time query list and detailed analysis for each query. It is by far the most powerful tool when digging into a single slow query.
What to look at in Query Detail:
| Item | Meaning | Diagnosis |
|---|---|---|
| Peak Memory | Maximum memory used by the query | Near the limit → top tuning priority |
| Per-stage time | Which stage takes the longest | Identify the bottleneck stage |
| Input/Output rows | Where data explodes | Join blowup |
| Spilled Data | Whether and how much spill occurred | Sign of memory pressure |
| Splits (scheduling) | Split distribution across workers | Data skew |
Diagnosing data skew: if a single worker's split count and processing time are conspicuously high, data is piling up on a specific key. Suspect the join key distribution, and revisit your partition/bucket design if needed.
4. system Tables — Querying Cluster State with SQL
The system.runtime schema lets you query cluster state with SQL, which makes it useful for automation and recurring reports.
-- Queries currently running/queued (largest memory first)
SELECT query_id, state, user, resource_group_id,
total_memory_reservation, elapsed_time, query
FROM system.runtime.queries
WHERE state IN ('RUNNING', 'QUEUED')
ORDER BY total_memory_reservation DESC;
-- Active worker nodes
SELECT node_id, http_uri, node_version, state
FROM system.runtime.nodes;
-- Find long-running queries (e.g. over 10 minutes)
SELECT query_id, user, elapsed_time, query
FROM system.runtime.queries
WHERE state = 'RUNNING' AND elapsed_time > INTERVAL '10' MINUTE
ORDER BY elapsed_time DESC;-- Forcibly terminate a runaway query
CALL system.runtime.kill_query(query_id => '20260605_120000_00001_abcde',
message => 'admin killed: runaway scan');5. Event Listeners — Query Auditing and Long-Term Analysis
The Web UI and system tables are ephemeral (coordinator restarts, history retention limits). For long-term analysis and auditing, use an event listener to ship query start/completion events to external storage.
# etc/event-listener.properties
event-listener.name=... # e.g. HTTP, Kafka, or a custom pluginEvents carry the query text, user, source, execution time, scanned bytes, memory, success/failure, error codes, and more. If you receive them via Kafka/HTTP and load them into an Iceberg table, you can run analyses like the following — using Trino itself.
-- Top 20 most expensive (highest scan volume) queries yesterday
SELECT user, scanned_bytes / 1e9 AS scanned_gb, elapsed_ms, query
FROM iceberg.audit.query_log
WHERE event_date = current_date - INTERVAL '1' DAY
ORDER BY scanned_bytes DESC
LIMIT 20;
-- Failure rate trend by user/source
SELECT user, count(*) AS total,
count_if(state = 'FAILED') AS failed,
count_if(state = 'FAILED') * 1.0 / count(*) AS fail_rate
FROM iceberg.audit.query_log
WHERE event_date >= current_date - INTERVAL '7' DAY
GROUP BY user
ORDER BY fail_rate DESC;This query log becomes the foundation for security auditing, cost analysis (who consumes the most resources), and data-driven Resource Group policy tuning.
6. A Procedure for Diagnosing Slow Queries
When you hit a symptom, narrow it down step by step.
1. Is it the whole cluster, or just this query? → JMX dashboard (queued / memory / worker count)
│ Cluster-wide → resource shortage or node failure (scale out / check discovery)
│ Single query ↓
2. Which stage is the bottleneck? → Web UI Query Detail (per-stage time)
│
3. Do estimated vs actual row counts diverge? → EXPLAIN ANALYZE (stats problem → ANALYZE)
│
4. Is it data skew? → Web UI splits (variance across workers)
│
5. Is spill kicking in? → Web UI Spilled Data (memory pressure)
│
6. Did pushdown/pruning break? → EXPLAIN (function wrapping, statistics)(For EXPLAIN and statistics in depth, see the separate post "A Deep Dive into the Trino Cost-Based Optimizer"; for memory and spill, see "Trino Memory Management and Resource Groups".)
7. Quick Reference: Symptom → Metric → Remedy
| Symptom | Metric to check first | Remedy |
|---|---|---|
| Queries piling up in the queue | queued queries, concurrency limits | Adjust Resource Groups, add workers |
| Whole cluster slow | worker CPU, memory, node count | Scale out, recover failed nodes |
| Only a specific query slow | Web UI stage times | Tune the bottleneck stage |
| Intermittent OOM | cluster memory, peak memory | Spill, memory limits, query optimization |
| Worker count fluctuating | active workers, GC | Check discovery, heap, node pool |
| One query hogging resources | runtime.queries | kill_query + Resource Groups |
| Failure rate spiking | failed ratio, error codes | Analyze error patterns via the event log |
8. Recommended Alerting Rules
At minimum, set up the following in Grafana/Alertmanager:
- Active worker count below the expected value for N minutes
- Cluster memory utilization > 85% for N minutes
- Queued query count > threshold for N minutes
- Query failure rate > threshold
- Coordinator health check (
/v1/info) failing
9. Summary
| Layer | Purpose | Key points |
|---|---|---|
| JMX + Prometheus | Cluster health, trends, alerting | queued, memory, worker count |
| Web UI | Deep analysis of a single query | peak memory, stages, skew, spill |
| system.runtime | Real-time state, automation | queries, nodes, kill_query |
| Event listeners | Auditing, long-term and cost analysis | Load query logs into Iceberg |
The core of Trino observability is first determining "cluster-wide or single query?", then narrowing down in order: JMX dashboard → Web UI → EXPLAIN ANALYZE. On top of that, if you accumulate query history with an event listener, you get not only after-the-fact auditing and cost analysis, but also the data to tune Resource Group policies. Operating on metrics rather than guesswork — that is what makes a Trino cluster stable.
This post is based on the Trino 440 series. If you need help building Trino monitoring dashboards or establishing a query auditing and cost analysis pipeline, feel free to reach out.
— Data Dynamics Engineering Team