Blog
trinoobservabilitymonitoringprometheusjmxdata-platform

Trino Observability — JMX Metrics, the Web UI, and Diagnosing Slow Queries

What should you monitor to run production Trino reliably? This post covers JMX metrics and Prometheus scraping, query analysis with the Web UI, using system.runtime tables, query auditing and history capture with event listeners, and a step-by-step procedure for diagnosing slow queries.

Data DynamicsJune 5, 20268 min read

When a Trino cluster slows down or queries start failing, you can only operate it effectively if you can answer "why?" Without observability, you end up chasing user complaints and responding on guesswork. Fortunately, Trino ships with a rich set of observability tools out of the box — JMX metrics, the Web UI, system tables, and event listeners.

This post covers what to monitor, how to collect and visualize it, and the order in which to dig in when you hit a slow query.

1. The Four Layers of Observability

(1) JMX metrics      → Time series and alerting via Prometheus/Grafana (cluster health)
(2) Web UI           → Real-time query and stage analysis (single-query diagnosis)
(3) system tables    → Current state and history via SQL (automation and reporting)
(4) Event listeners  → Capture query start/completion events (auditing, long-term analysis)
LayerQuestion it answersTool
JMXIs the cluster healthy? What are the trends?Prometheus, Grafana
Web UIWhy is this query slow?Coordinator Web UI
system tablesWhat is running right now?system.runtime.*
Event listenersWho ran what queries yesterday?Event listener → external storage

2. JMX Metrics and Prometheus

Trino exposes its internal state via JMX, and also provides a jmx catalog that lets you query JMX with SQL. In production, the typical setup is scrape with Prometheus → visualize with Grafana → alert with Alertmanager.

The coordinator can export Prometheus-format metrics via the /metrics endpoint (or a JMX exporter).

# Example Prometheus scrape config
scrape_configs:
  - job_name: trino
    metrics_path: /metrics
    static_configs:
      - targets: ['trino-coordinator:8080']

Key Metrics You Must Watch

Metric areaWhat to look atAlert signal
Query countsrunning / queued / blockedqueued keeps growing → resource shortage
Query outcomescompleted / failed ratiofailure rate spikes
Cluster memoryusage / reservationapproaching the limit → OOM risk
Worker node countactive workersdropping → discovery or node failure
CPUworker CPU utilizationsustained saturation → add capacity / scale
GCGC pause time and frequencygetting longer → heap pressure
Spooling (FTE)exchange I/Ospiking → excessive retries

We recommend putting queued query count, cluster memory, and worker node count at the top of your dashboard — most incidents show up in these three metrics first.

Ad-hoc Queries with the jmx Catalog

-- Cluster memory pool state via SQL
SELECT node, freebytes, maxbytes
FROM jmx.current."trino.memory:name=general,type=memorypool";

3. The Web UI — the Core Tool for Single-Query Diagnosis

The coordinator Web UI (port 8080/8443 by default) provides a real-time query list and detailed analysis for each query. It is by far the most powerful tool when digging into a single slow query.

What to look at in Query Detail:

ItemMeaningDiagnosis
Peak MemoryMaximum memory used by the queryNear the limit → top tuning priority
Per-stage timeWhich stage takes the longestIdentify the bottleneck stage
Input/Output rowsWhere data explodesJoin blowup
Spilled DataWhether and how much spill occurredSign of memory pressure
Splits (scheduling)Split distribution across workersData skew

Diagnosing data skew: if a single worker's split count and processing time are conspicuously high, data is piling up on a specific key. Suspect the join key distribution, and revisit your partition/bucket design if needed.

4. system Tables — Querying Cluster State with SQL

The system.runtime schema lets you query cluster state with SQL, which makes it useful for automation and recurring reports.

-- Queries currently running/queued (largest memory first)
SELECT query_id, state, user, resource_group_id,
       total_memory_reservation, elapsed_time, query
FROM system.runtime.queries
WHERE state IN ('RUNNING', 'QUEUED')
ORDER BY total_memory_reservation DESC;
 
-- Active worker nodes
SELECT node_id, http_uri, node_version, state
FROM system.runtime.nodes;
 
-- Find long-running queries (e.g. over 10 minutes)
SELECT query_id, user, elapsed_time, query
FROM system.runtime.queries
WHERE state = 'RUNNING' AND elapsed_time > INTERVAL '10' MINUTE
ORDER BY elapsed_time DESC;
-- Forcibly terminate a runaway query
CALL system.runtime.kill_query(query_id => '20260605_120000_00001_abcde',
                               message => 'admin killed: runaway scan');

5. Event Listeners — Query Auditing and Long-Term Analysis

The Web UI and system tables are ephemeral (coordinator restarts, history retention limits). For long-term analysis and auditing, use an event listener to ship query start/completion events to external storage.

# etc/event-listener.properties
event-listener.name=...      # e.g. HTTP, Kafka, or a custom plugin

Events carry the query text, user, source, execution time, scanned bytes, memory, success/failure, error codes, and more. If you receive them via Kafka/HTTP and load them into an Iceberg table, you can run analyses like the following — using Trino itself.

-- Top 20 most expensive (highest scan volume) queries yesterday
SELECT user, scanned_bytes / 1e9 AS scanned_gb, elapsed_ms, query
FROM iceberg.audit.query_log
WHERE event_date = current_date - INTERVAL '1' DAY
ORDER BY scanned_bytes DESC
LIMIT 20;
 
-- Failure rate trend by user/source
SELECT user, count(*) AS total,
       count_if(state = 'FAILED') AS failed,
       count_if(state = 'FAILED') * 1.0 / count(*) AS fail_rate
FROM iceberg.audit.query_log
WHERE event_date >= current_date - INTERVAL '7' DAY
GROUP BY user
ORDER BY fail_rate DESC;

This query log becomes the foundation for security auditing, cost analysis (who consumes the most resources), and data-driven Resource Group policy tuning.

6. A Procedure for Diagnosing Slow Queries

When you hit a symptom, narrow it down step by step.

1. Is it the whole cluster, or just this query?  → JMX dashboard (queued / memory / worker count)
        │ Cluster-wide → resource shortage or node failure (scale out / check discovery)
        │ Single query ↓
2. Which stage is the bottleneck?                → Web UI Query Detail (per-stage time)

3. Do estimated vs actual row counts diverge?    → EXPLAIN ANALYZE (stats problem → ANALYZE)

4. Is it data skew?                              → Web UI splits (variance across workers)

5. Is spill kicking in?                          → Web UI Spilled Data (memory pressure)

6. Did pushdown/pruning break?                   → EXPLAIN (function wrapping, statistics)

(For EXPLAIN and statistics in depth, see the separate post "A Deep Dive into the Trino Cost-Based Optimizer"; for memory and spill, see "Trino Memory Management and Resource Groups".)

7. Quick Reference: Symptom → Metric → Remedy

SymptomMetric to check firstRemedy
Queries piling up in the queuequeued queries, concurrency limitsAdjust Resource Groups, add workers
Whole cluster slowworker CPU, memory, node countScale out, recover failed nodes
Only a specific query slowWeb UI stage timesTune the bottleneck stage
Intermittent OOMcluster memory, peak memorySpill, memory limits, query optimization
Worker count fluctuatingactive workers, GCCheck discovery, heap, node pool
One query hogging resourcesruntime.querieskill_query + Resource Groups
Failure rate spikingfailed ratio, error codesAnalyze error patterns via the event log

At minimum, set up the following in Grafana/Alertmanager:

  • Active worker count below the expected value for N minutes
  • Cluster memory utilization > 85% for N minutes
  • Queued query count > threshold for N minutes
  • Query failure rate > threshold
  • Coordinator health check (/v1/info) failing

9. Summary

LayerPurposeKey points
JMX + PrometheusCluster health, trends, alertingqueued, memory, worker count
Web UIDeep analysis of a single querypeak memory, stages, skew, spill
system.runtimeReal-time state, automationqueries, nodes, kill_query
Event listenersAuditing, long-term and cost analysisLoad query logs into Iceberg

The core of Trino observability is first determining "cluster-wide or single query?", then narrowing down in order: JMX dashboard → Web UI → EXPLAIN ANALYZE. On top of that, if you accumulate query history with an event listener, you get not only after-the-fact auditing and cost analysis, but also the data to tune Resource Group policies. Operating on metrics rather than guesswork — that is what makes a Trino cluster stable.


This post is based on the Trino 440 series. If you need help building Trino monitoring dashboards or establishing a query auditing and cost analysis pipeline, feel free to reach out.

— Data Dynamics Engineering Team