trinoobservabilitymonitoringprometheusjmxdata-platform

Trino Observability — JMX Metrics, the Web UI, and Diagnosing Slow Queries

What should you monitor to run production Trino reliably? This post covers JMX metrics and Prometheus scraping, query analysis with the Web UI, using system.runtime tables, query auditing and history capture with event listeners, and a step-by-step procedure for diagnosing slow queries.

Data DynamicsJune 5, 20268 min read

When a Trino cluster slows down or queries start failing, you can only operate it effectively if you can answer "why?" Without observability, you end up chasing user complaints and responding on guesswork. Fortunately, Trino ships with a rich set of observability tools out of the box — JMX metrics, the Web UI, system tables, and event listeners.

This post covers what to monitor, how to collect and visualize it, and the order in which to dig in when you hit a slow query.

1. The Four Layers of Observability

Loading diagram…

Layer	Question it answers	Tool
JMX	Is the cluster healthy? What are the trends?	Prometheus, Grafana
Web UI	Why is this query slow?	Coordinator Web UI
system tables	What is running right now?	`system.runtime.*`
Event listeners	Who ran what queries yesterday?	Event listener → external storage

2. JMX Metrics and Prometheus

Trino exposes its internal state via JMX, and also provides a jmx catalog that lets you query JMX with SQL. In production, the typical setup is scrape with Prometheus → visualize with Grafana → alert with Alertmanager.

The coordinator can export Prometheus-format metrics via the /metrics endpoint (or a JMX exporter).

# Example Prometheus scrape config
scrape_configs:
  - job_name: trino
    metrics_path: /metrics
    static_configs:
      - targets: ['trino-coordinator:8080']

Key Metrics You Must Watch

Metric area	What to look at	Alert signal
Query counts	running / queued / blocked	queued keeps growing → resource shortage
Query outcomes	completed / failed ratio	failure rate spikes
Cluster memory	usage / reservation	approaching the limit → OOM risk
Worker node count	active workers	dropping → discovery or node failure
CPU	worker CPU utilization	sustained saturation → add capacity / scale
GC	GC pause time and frequency	getting longer → heap pressure
Spooling (FTE)	exchange I/O	spiking → excessive retries

We recommend putting queued query count, cluster memory, and worker node count at the top of your dashboard — most incidents show up in these three metrics first.

Ad-hoc Queries with the jmx Catalog

-- Cluster memory pool state via SQL
SELECT node, freebytes, maxbytes
FROM jmx.current."trino.memory:name=general,type=memorypool";

3. The Web UI — the Core Tool for Single-Query Diagnosis

The coordinator Web UI (port 8080/8443 by default) provides a real-time query list and detailed analysis for each query. It is by far the most powerful tool when digging into a single slow query.

What to look at in Query Detail:

Item	Meaning	Diagnosis
Peak Memory	Maximum memory used by the query	Near the limit → top tuning priority
Per-stage time	Which stage takes the longest	Identify the bottleneck stage
Input/Output rows	Where data explodes	Join blowup
Spilled Data	Whether and how much spill occurred	Sign of memory pressure
Splits (scheduling)	Split distribution across workers	Data skew

Diagnosing data skew: if a single worker's split count and processing time are conspicuously high, data is piling up on a specific key. Suspect the join key distribution, and revisit your partition/bucket design if needed.

4. system Tables — Querying Cluster State with SQL

The system.runtime schema lets you query cluster state with SQL, which makes it useful for automation and recurring reports.

-- Queries currently running/queued (largest memory first)
SELECT query_id, state, user, resource_group_id,
       total_memory_reservation, elapsed_time, query
FROM system.runtime.queries
WHERE state IN ('RUNNING', 'QUEUED')
ORDER BY total_memory_reservation DESC;
 
-- Active worker nodes
SELECT node_id, http_uri, node_version, state
FROM system.runtime.nodes;
 
-- Find long-running queries (e.g. over 10 minutes)
SELECT query_id, user, elapsed_time, query
FROM system.runtime.queries
WHERE state = 'RUNNING' AND elapsed_time > INTERVAL '10' MINUTE
ORDER BY elapsed_time DESC;

-- Forcibly terminate a runaway query
CALL system.runtime.kill_query(query_id => '20260605_120000_00001_abcde',
                               message => 'admin killed: runaway scan');

5. Event Listeners — Query Auditing and Long-Term Analysis

The Web UI and system tables are ephemeral (coordinator restarts, history retention limits). For long-term analysis and auditing, use an event listener to ship query start/completion events to external storage.

# etc/event-listener.properties
event-listener.name=...      # e.g. HTTP, Kafka, or a custom plugin

Events carry the query text, user, source, execution time, scanned bytes, memory, success/failure, error codes, and more. If you receive them via Kafka/HTTP and load them into an Iceberg table, you can run analyses like the following — using Trino itself.

-- Top 20 most expensive (highest scan volume) queries yesterday
SELECT user, scanned_bytes / 1e9 AS scanned_gb, elapsed_ms, query
FROM iceberg.audit.query_log
WHERE event_date = current_date - INTERVAL '1' DAY
ORDER BY scanned_bytes DESC
LIMIT 20;
 
-- Failure rate trend by user/source
SELECT user, count(*) AS total,
       count_if(state = 'FAILED') AS failed,
       count_if(state = 'FAILED') * 1.0 / count(*) AS fail_rate
FROM iceberg.audit.query_log
WHERE event_date >= current_date - INTERVAL '7' DAY
GROUP BY user
ORDER BY fail_rate DESC;

This query log becomes the foundation for security auditing, cost analysis (who consumes the most resources), and data-driven Resource Group policy tuning.

6. A Procedure for Diagnosing Slow Queries

When you hit a symptom, narrow it down step by step.

Loading diagram…

(For EXPLAIN and statistics in depth, see the separate post "A Deep Dive into the Trino Cost-Based Optimizer"; for memory and spill, see "Trino Memory Management and Resource Groups".)

7. Quick Reference: Symptom → Metric → Remedy

Symptom	Metric to check first	Remedy
Queries piling up in the queue	queued queries, concurrency limits	Adjust Resource Groups, add workers
Whole cluster slow	worker CPU, memory, node count	Scale out, recover failed nodes
Only a specific query slow	Web UI stage times	Tune the bottleneck stage
Intermittent OOM	cluster memory, peak memory	Spill, memory limits, query optimization
Worker count fluctuating	active workers, GC	Check discovery, heap, node pool
One query hogging resources	runtime.queries	kill_query + Resource Groups
Failure rate spiking	failed ratio, error codes	Analyze error patterns via the event log

8. Recommended Alerting Rules

At minimum, set up the following in Grafana/Alertmanager:

Active worker count below the expected value for N minutes
Cluster memory utilization > 85% for N minutes
Queued query count > threshold for N minutes
Query failure rate > threshold
Coordinator health check (/v1/info) failing

9. Summary

Layer	Purpose	Key points
JMX + Prometheus	Cluster health, trends, alerting	queued, memory, worker count
Web UI	Deep analysis of a single query	peak memory, stages, skew, spill
system.runtime	Real-time state, automation	`queries`, `nodes`, `kill_query`
Event listeners	Auditing, long-term and cost analysis	Load query logs into Iceberg

The core of Trino observability is first determining "cluster-wide or single query?", then narrowing down in order: JMX dashboard → Web UI → EXPLAIN ANALYZE. On top of that, if you accumulate query history with an event listener, you get not only after-the-fact auditing and cost analysis, but also the data to tune Resource Group policies. Operating on metrics rather than guesswork — that is what makes a Trino cluster stable.

This post is based on the Trino 440 series. If you need help building Trino monitoring dashboards or establishing a query auditing and cost analysis pipeline, feel free to reach out.

— Data Dynamics Engineering Team