trinoperformancetuningresource-groupsmemorydata-platform

Trino Memory Management and Resource Groups

A practical walkthrough of Trino's memory model (per-node, cluster, heap headroom), spill-to-disk, and how to isolate workloads with Resource Groups so a single query can never take down the whole cluster — with real-world configuration examples.

Data DynamicsJune 5, 20268 min read

The two most common failures in Trino operations are "one giant query eats all the memory and paralyzes the entire cluster" and "ad-hoc queries from one team hog the concurrency slots, pushing scheduled batch jobs behind." Both problems can be prevented structurally once you understand the memory model and Resource Groups.

This post explains how Trino distributes memory, what spill-to-disk saves you from, and how to isolate workloads with Resource Groups — with real configuration examples.

1. The Trino Memory Model — Three Boundaries

Trino memory is governed by three boundaries.

[ Worker node JVM heap ]
   ├── heap-headroom-per-node   (reserved area Trino never touches: GC, buffers)
   └── Query memory pool
         └── query.max-memory-per-node   (cap on what one query can use on this node)
 
[ Cluster-wide ]
   └── query.max-memory   (cap on what one query can use summed across all workers)

Setting	Scope	Meaning
`query.max-memory-per-node`	One worker	Maximum memory a single query can use on that node
`query.max-memory`	Cluster	Maximum memory a single query can use summed across all workers
`memory.heap-headroom-per-node`	One worker	Heap that Trino leaves unallocated to queries (for GC and system use)

The relationship (approximately):

query.max-memory-per-node  <  (JVM heap - heap-headroom-per-node)
query.max-memory           ≈  query.max-memory-per-node × number of workers  (or less)

For example, with a 24GB worker heap and 5GB of headroom, the query pool is about 19GB. Setting query.max-memory-per-node to 12GB means one query can use up to 12GB on a node, leaving the rest for other queries to share.

# etc/config.properties (workers)
query.max-memory-per-node=12GB
memory.heap-headroom-per-node=5GB
 
# Coordinator
query.max-memory=240GB

When a query exceeds these caps, only that query fails with EXCEEDED_LOCAL_MEMORY_LIMIT or EXCEEDED_GLOBAL_MEMORY_LIMIT. The caps are what prevent a single query from destabilizing the entire cluster.

2. Spill-to-Disk — Slow Down Instead of OOM

Instead of unconditionally killing queries that hit the memory cap, you can let them spill intermediate data to disk and run to completion. Spill works for joins, aggregations, sorts, and window operations.

# etc/config.properties
spill-enabled=true
spiller-spill-path=/data1/trino-spill,/data2/trino-spill   # spreading across multiple disks is recommended
max-spill-per-node=200GB
query-max-spill-per-node=100GB

The trade-off is clear-cut.

	spill off	spill on
On memory overrun	Query fails	Spills to disk and finishes
Performance	Fast (in-memory)	Slower (disk I/O)
Best for	Low-latency interactive	Large batch / ETL

Spill is a safety net that lets queries "finish, even if slowly." Processing everything in memory is the ideal, but it beats losing the cluster to the occasional oversized query. Spread the spill paths across multiple fast local SSDs.

When even spill is not enough, the root fix is to improve the query itself — select only the columns you need instead of SELECT *, shrink the build side of large joins, pre-aggregate, and leverage partition pruning.

3. The Problem: Memory Caps Alone Are Not Enough

query.max-memory caps "one query." But real-world problems are usually about contention between many queries.

If 30 analysts fire heavy queries at the same time, cluster memory is exhausted even though each query stays within its cap.
Low-priority ad-hoc queries occupy the concurrency slots, pushing SLA-bound scheduled batches back in the queue.
A storm of dashboard refreshes from one BI tool degrades responsiveness for everyone.

This is what Resource Groups solve. (If you come from Cloudera, this is the counterpart of Impala's Admission Control / resource pools.)

4. Resource Groups — Workload Isolation

Resource Groups classify incoming queries into groups and give each group concurrency, memory, and queue limits.

# etc/resource-groups.properties
resource-groups.configuration-manager=file
resource-groups.config-file=/etc/trino/resource-groups.json

{
  "rootGroups": [
    {
      "name": "global",
      "softMemoryLimit": "80%",
      "hardConcurrencyLimit": 100,
      "maxQueued": 1000,
      "schedulingPolicy": "weighted",
      "subGroups": [
        {
          "name": "batch",
          "softMemoryLimit": "60%",
          "hardConcurrencyLimit": 20,
          "maxQueued": 500,
          "schedulingWeight": 3
        },
        {
          "name": "adhoc",
          "softMemoryLimit": "30%",
          "hardConcurrencyLimit": 40,
          "maxQueued": 200,
          "schedulingWeight": 1
        },
        {
          "name": "dashboard",
          "softMemoryLimit": "20%",
          "hardConcurrencyLimit": 50,
          "maxQueued": 100,
          "schedulingWeight": 2
        }
      ]
    }
  ],
  "selectors": [
    { "user": "etl-svc",           "group": "global.batch" },
    { "source": "superset",        "group": "global.dashboard" },
    { "group": "global.adhoc" }
  ]
}

Key Properties

Property	Meaning
`hardConcurrencyLimit`	Maximum number of queries running concurrently
`maxQueued`	Maximum length of the wait queue; queries beyond it are rejected
`softMemoryLimit`	Memory this group may use (%, or an absolute value). When exceeded, new queries wait
`schedulingPolicy`	Policy for picking the next query from the queue
`schedulingWeight`	Resource-sharing weight between groups under the `weighted` policy

Selectors — Matching Queries to Groups

selectors assign a query to the first matching rule, evaluated top to bottom. You can classify by user, source (the source name sent by the client), clientTags, queryType, and more.

Loading diagram…

With this in place, no matter how many ad-hoc analyst queries (adhoc) pile up, they cannot encroach on the concurrency or memory of batch (batch). This is the key mechanism for protecting SLA-bound workloads.

5. Choosing a Scheduling Policy

Policy	Behavior	Best for
`fair`	Treats queries within a group fairly (FIFO-based)	A sensible default
`weighted`	Distributes resources to subgroups by `schedulingWeight` ratio	Differentiated priority between groups
`weighted_fair`	Compromise between weights and fairness	Balancing many groups
`query_priority`	Prioritizes by each query's `query_priority` session value	Dynamic priority control

The example above uses weighted, allocating resources at a batch:dashboard:adhoc ratio of 3:2:1.

6. Query-Level Controls — Extra Guardrails

Beyond Resource Groups, global and session settings help stop runaway queries.

# etc/config.properties — block abnormally heavy/slow queries
query.max-execution-time=2h
query.max-scan-physical-bytes=5TB
query.max-stage-count=150

-- Per-session adjustment (for a specific query only)
SET SESSION query_max_memory_per_node = '20GB';
SET SESSION query_max_run_time = '30m';

query.max-scan-physical-bytes is a handy guardrail that kills a query once it scans past a given byte threshold — useful when someone accidentally launches a giant full scan.

7. Diagnostics — What to Look at When Tuning

7.1 System Tables

-- Currently running/queued queries and their memory usage
SELECT query_id, state, resource_group_id,
       query, total_memory_reservation
FROM system.runtime.queries
WHERE state IN ('RUNNING', 'QUEUED')
ORDER BY total_memory_reservation DESC;
 
-- Per-resource-group status
SELECT * FROM system.runtime.resource_group_pools;

7.2 EXPLAIN ANALYZE

After actually running a query, inspect time, memory, and row counts per stage.

EXPLAIN ANALYZE
SELECT user_id, count(*)
FROM iceberg.analytics.events
WHERE event_time >= TIMESTAMP '2026-06-01 00:00:00 UTC'
GROUP BY user_id;

This is where bad join orders or a single stage monopolizing memory show up. If join quality is poor, refresh the statistics.

ANALYZE iceberg.analytics.events;

7.3 Web UI

The Query Detail page in the coordinator Web UI lets you visually inspect per-stage memory, execution time, and whether spill occurred. Queries whose "Peak Memory" approaches the cap are your top tuning candidates.

8. Quick Symptom-to-Fix Table

Symptom	Likely cause	Fix
`EXCEEDED_LOCAL_MEMORY_LIMIT`	Query exceeded the cap on one node	Enable spill, raise `query.max-memory-per-node`, optimize the query
`EXCEEDED_GLOBAL_MEMORY_LIMIT`	Cluster-wide cap exceeded	Review `query.max-memory`, add workers, split the query
Worker OOMKill / GC storms	Heap ≈ container limit, insufficient headroom	Raise headroom, ensure heap < limit
Batches starved by ad-hoc queries	No workload isolation	Separate batch/adhoc with Resource Groups
Dashboard storms degrade everything	Unbounded dashboard concurrency	Concurrency and queue limits on the dashboard group
Accidental giant full scan	No guardrails	Set `query.max-scan-physical-bytes`
Query runs but is slow	Bad join order, missing statistics	Diagnose with `ANALYZE` and `EXPLAIN ANALYZE`

9. Tuning Checklist

JVM heap < container/node memory (20-30% headroom)
query.max-memory-per-node < (heap − headroom)
Spill enabled + multiple paths on fast local disks
batch/adhoc/dashboard isolated with Resource Groups
Selectors map groups by user and source
Global guardrails such as query.max-scan-physical-bytes
Regular ANALYZE on key tables
Review top peak-memory queries via the Web UI / system tables

10. Summary

Layer	Tool	What it prevents
Single-query memory	`query.max-memory(-per-node)`	One query monopolizing the cluster
Completion guarantee	spill-to-disk	Failures caused by OOM
Workload contention	Resource Groups	Groups encroaching on each other, SLA violations
Runaway incidents	Global query limits	Accidental giant scans, unbounded execution
Execution quality	`ANALYZE` + `EXPLAIN ANALYZE`	Bad join orders, slow queries

Stabilizing Trino comes down to two things. First, understand the memory model so heap, headroom, and query caps are set consistently, with spill as the safety net. Second, isolate workloads with Resource Groups so no single query or team can paralyze the whole cluster. With both in place, your Trino cluster will "slow down but never stop" even under heavy load.

This article is based on the Trino 440 series. If you need help with memory and concurrency tuning or workload isolation design for your Trino cluster, feel free to reach out.

— Data Dynamics Engineering Team