Trino Memory Management and Resource Groups
A practical walkthrough of Trino's memory model (per-node, cluster, heap headroom), spill-to-disk, and how to isolate workloads with Resource Groups so a single query can never take down the whole cluster — with real-world configuration examples.
The two most common failures in Trino operations are "one giant query eats all the memory and paralyzes the entire cluster" and "ad-hoc queries from one team hog the concurrency slots, pushing scheduled batch jobs behind." Both problems can be prevented structurally once you understand the memory model and Resource Groups.
This post explains how Trino distributes memory, what spill-to-disk saves you from, and how to isolate workloads with Resource Groups — with real configuration examples.
1. The Trino Memory Model — Three Boundaries
Trino memory is governed by three boundaries.
[ Worker node JVM heap ]
├── heap-headroom-per-node (reserved area Trino never touches: GC, buffers)
└── Query memory pool
└── query.max-memory-per-node (cap on what one query can use on this node)
[ Cluster-wide ]
└── query.max-memory (cap on what one query can use summed across all workers)| Setting | Scope | Meaning |
|---|---|---|
query.max-memory-per-node | One worker | Maximum memory a single query can use on that node |
query.max-memory | Cluster | Maximum memory a single query can use summed across all workers |
memory.heap-headroom-per-node | One worker | Heap that Trino leaves unallocated to queries (for GC and system use) |
The relationship (approximately):
query.max-memory-per-node < (JVM heap - heap-headroom-per-node)
query.max-memory ≈ query.max-memory-per-node × number of workers (or less)For example, with a 24GB worker heap and 5GB of headroom, the query pool is about 19GB. Setting query.max-memory-per-node to 12GB means one query can use up to 12GB on a node, leaving the rest for other queries to share.
# etc/config.properties (workers)
query.max-memory-per-node=12GB
memory.heap-headroom-per-node=5GB
# Coordinator
query.max-memory=240GBWhen a query exceeds these caps, only that query fails with EXCEEDED_LOCAL_MEMORY_LIMIT or EXCEEDED_GLOBAL_MEMORY_LIMIT. The caps are what prevent a single query from destabilizing the entire cluster.
2. Spill-to-Disk — Slow Down Instead of OOM
Instead of unconditionally killing queries that hit the memory cap, you can let them spill intermediate data to disk and run to completion. Spill works for joins, aggregations, sorts, and window operations.
# etc/config.properties
spill-enabled=true
spiller-spill-path=/data1/trino-spill,/data2/trino-spill # spreading across multiple disks is recommended
max-spill-per-node=200GB
query-max-spill-per-node=100GBThe trade-off is clear-cut.
| spill off | spill on | |
|---|---|---|
| On memory overrun | Query fails | Spills to disk and finishes |
| Performance | Fast (in-memory) | Slower (disk I/O) |
| Best for | Low-latency interactive | Large batch / ETL |
Spill is a safety net that lets queries "finish, even if slowly." Processing everything in memory is the ideal, but it beats losing the cluster to the occasional oversized query. Spread the spill paths across multiple fast local SSDs.
When even spill is not enough, the root fix is to improve the query itself — select only the columns you need instead of SELECT *, shrink the build side of large joins, pre-aggregate, and leverage partition pruning.
3. The Problem: Memory Caps Alone Are Not Enough
query.max-memory caps "one query." But real-world problems are usually about contention between many queries.
- If 30 analysts fire heavy queries at the same time, cluster memory is exhausted even though each query stays within its cap.
- Low-priority ad-hoc queries occupy the concurrency slots, pushing SLA-bound scheduled batches back in the queue.
- A storm of dashboard refreshes from one BI tool degrades responsiveness for everyone.
This is what Resource Groups solve. (If you come from Cloudera, this is the counterpart of Impala's Admission Control / resource pools.)
4. Resource Groups — Workload Isolation
Resource Groups classify incoming queries into groups and give each group concurrency, memory, and queue limits.
# etc/resource-groups.properties
resource-groups.configuration-manager=file
resource-groups.config-file=/etc/trino/resource-groups.json{
"rootGroups": [
{
"name": "global",
"softMemoryLimit": "80%",
"hardConcurrencyLimit": 100,
"maxQueued": 1000,
"schedulingPolicy": "weighted",
"subGroups": [
{
"name": "batch",
"softMemoryLimit": "60%",
"hardConcurrencyLimit": 20,
"maxQueued": 500,
"schedulingWeight": 3
},
{
"name": "adhoc",
"softMemoryLimit": "30%",
"hardConcurrencyLimit": 40,
"maxQueued": 200,
"schedulingWeight": 1
},
{
"name": "dashboard",
"softMemoryLimit": "20%",
"hardConcurrencyLimit": 50,
"maxQueued": 100,
"schedulingWeight": 2
}
]
}
],
"selectors": [
{ "user": "etl-svc", "group": "global.batch" },
{ "source": "superset", "group": "global.dashboard" },
{ "group": "global.adhoc" }
]
}Key Properties
| Property | Meaning |
|---|---|
hardConcurrencyLimit | Maximum number of queries running concurrently |
maxQueued | Maximum length of the wait queue; queries beyond it are rejected |
softMemoryLimit | Memory this group may use (%, or an absolute value). When exceeded, new queries wait |
schedulingPolicy | Policy for picking the next query from the queue |
schedulingWeight | Resource-sharing weight between groups under the weighted policy |
Selectors — Matching Queries to Groups
selectors assign a query to the first matching rule, evaluated top to bottom. You can classify by user, source (the source name sent by the client), clientTags, queryType, and more.
etl-svc user → global.batch (concurrency 20, memory 60%, high weight)
superset dashboards → global.dashboard (high concurrency, small queries)
everything else → global.adhoc (ad-hoc analysis, low weight)With this in place, no matter how many ad-hoc analyst queries (adhoc) pile up, they cannot encroach on the concurrency or memory of batch (batch). This is the key mechanism for protecting SLA-bound workloads.
5. Choosing a Scheduling Policy
| Policy | Behavior | Best for |
|---|---|---|
fair | Treats queries within a group fairly (FIFO-based) | A sensible default |
weighted | Distributes resources to subgroups by schedulingWeight ratio | Differentiated priority between groups |
weighted_fair | Compromise between weights and fairness | Balancing many groups |
query_priority | Prioritizes by each query's query_priority session value | Dynamic priority control |
The example above uses weighted, allocating resources at a batch:dashboard:adhoc ratio of 3:2:1.
6. Query-Level Controls — Extra Guardrails
Beyond Resource Groups, global and session settings help stop runaway queries.
# etc/config.properties — block abnormally heavy/slow queries
query.max-execution-time=2h
query.max-scan-physical-bytes=5TB
query.max-stage-count=150-- Per-session adjustment (for a specific query only)
SET SESSION query_max_memory_per_node = '20GB';
SET SESSION query_max_run_time = '30m';query.max-scan-physical-bytes is a handy guardrail that kills a query once it scans past a given byte threshold — useful when someone accidentally launches a giant full scan.
7. Diagnostics — What to Look at When Tuning
7.1 System Tables
-- Currently running/queued queries and their memory usage
SELECT query_id, state, resource_group_id,
query, total_memory_reservation
FROM system.runtime.queries
WHERE state IN ('RUNNING', 'QUEUED')
ORDER BY total_memory_reservation DESC;
-- Per-resource-group status
SELECT * FROM system.runtime.resource_group_pools;7.2 EXPLAIN ANALYZE
After actually running a query, inspect time, memory, and row counts per stage.
EXPLAIN ANALYZE
SELECT user_id, count(*)
FROM iceberg.analytics.events
WHERE event_time >= TIMESTAMP '2026-06-01 00:00:00 UTC'
GROUP BY user_id;This is where bad join orders or a single stage monopolizing memory show up. If join quality is poor, refresh the statistics.
ANALYZE iceberg.analytics.events;7.3 Web UI
The Query Detail page in the coordinator Web UI lets you visually inspect per-stage memory, execution time, and whether spill occurred. Queries whose "Peak Memory" approaches the cap are your top tuning candidates.
8. Quick Symptom-to-Fix Table
| Symptom | Likely cause | Fix |
|---|---|---|
EXCEEDED_LOCAL_MEMORY_LIMIT | Query exceeded the cap on one node | Enable spill, raise query.max-memory-per-node, optimize the query |
EXCEEDED_GLOBAL_MEMORY_LIMIT | Cluster-wide cap exceeded | Review query.max-memory, add workers, split the query |
| Worker OOMKill / GC storms | Heap ≈ container limit, insufficient headroom | Raise headroom, ensure heap < limit |
| Batches starved by ad-hoc queries | No workload isolation | Separate batch/adhoc with Resource Groups |
| Dashboard storms degrade everything | Unbounded dashboard concurrency | Concurrency and queue limits on the dashboard group |
| Accidental giant full scan | No guardrails | Set query.max-scan-physical-bytes |
| Query runs but is slow | Bad join order, missing statistics | Diagnose with ANALYZE and EXPLAIN ANALYZE |
9. Tuning Checklist
- JVM heap < container/node memory (20-30% headroom)
-
query.max-memory-per-node< (heap − headroom) - Spill enabled + multiple paths on fast local disks
- batch/adhoc/dashboard isolated with Resource Groups
- Selectors map groups by user and source
- Global guardrails such as
query.max-scan-physical-bytes - Regular
ANALYZEon key tables - Review top peak-memory queries via the Web UI / system tables
10. Summary
| Layer | Tool | What it prevents |
|---|---|---|
| Single-query memory | query.max-memory(-per-node) | One query monopolizing the cluster |
| Completion guarantee | spill-to-disk | Failures caused by OOM |
| Workload contention | Resource Groups | Groups encroaching on each other, SLA violations |
| Runaway incidents | Global query limits | Accidental giant scans, unbounded execution |
| Execution quality | ANALYZE + EXPLAIN ANALYZE | Bad join orders, slow queries |
Stabilizing Trino comes down to two things. First, understand the memory model so heap, headroom, and query caps are set consistently, with spill as the safety net. Second, isolate workloads with Resource Groups so no single query or team can paralyze the whole cluster. With both in place, your Trino cluster will "slow down but never stop" even under heavy load.
This article is based on the Trino 440 series. If you need help with memory and concurrency tuning or workload isolation design for your Trino cluster, feel free to reach out.
— Data Dynamics Engineering Team