Deploying Trino on Kubernetes — Helm, Autoscaling, and Graceful Shutdown
A hands-on guide to running a Trino cluster on Kubernetes with the official Helm chart. Covers coordinator/worker configuration, catalog injection, resource and HPA autoscaling, graceful shutdown, the exchange manager for FTE, and running on spot instances.
Trino is a nearly stateless distributed engine made up of one coordinator and N workers. That property is a great fit for Kubernetes — workers are easy to scale up and down as Pods, can be autoscaled based on load, and can run on spot instances to cut costs. The catch is that you have to deal with classic distributed-systems questions like "what happens if a worker disappears mid-query?"
This post walks through deploying Trino on Kubernetes reliably using the official Helm chart, including operational concerns like autoscaling and graceful shutdown.
1. Architecture — Trino on Kubernetes
Ingress / LoadBalancer (TLS termination)
│
▼
┌───────────────────┐
│ Coordinator Pod │ (Deployment, replica 1)
│ Service :8080 │
└───────────────────┘
│ discovery + internal communication
┌───────────────────┼───────────────────┐
▼ ▼ ▼
Worker Pod Worker Pod Worker Pod (Deployment, replica N)
└──── HPA adjusts replicas based on load ────┘
Catalogs/config: injected via ConfigMap · Secret
FTE spool: S3/GCS (exchange manager)The key mapping:
| Trino concept | Kubernetes resource |
|---|---|
| Coordinator | Deployment (replica 1) + Service |
| Workers | Deployment (replica N) |
| Node discovery | Coordinator Service DNS |
| Catalogs/config | ConfigMap |
| Passwords & secrets | Secret |
| Autoscaling | HPA (or KEDA) |
| FTE spool storage | External object storage |
2. Getting Started with the Official Helm Chart
Trino ships an official Helm chart (trinodb/charts).
helm repo add trino https://trinodb.github.io/charts
helm repo update
# Default installation
helm install my-trino trino/trino --namespace trino --create-namespaceThe defaults bring up 1 coordinator + 2 workers, but for production you should author your own values.yaml.
3. values.yaml — Core Configuration
image:
tag: "version-pin" # Always pin the version. Never use latest
server:
workers: 4
config:
query:
maxMemoryPerNode: "12GB"
# Shared secret (internal communication) — inject via a Secret in practice
coordinatorExtraConfig: |
query.max-memory=80GB
exchangeManager:
name: filesystem # For FTE. See section 7 below
coordinator:
jvm:
maxHeapSize: "16G"
resources:
requests:
cpu: 4
memory: 18Gi
limits:
memory: 18Gi
worker:
jvm:
maxHeapSize: "24G"
resources:
requests:
cpu: 8
memory: 28Gi
limits:
memory: 28Gi
# Graceful shutdown so workers don't die mid-query (see section 6)
terminationGracePeriodSeconds: 120
additionalCatalogs:
iceberg: |
connector.name=iceberg
iceberg.catalog.type=rest
iceberg.rest-catalog.uri=http://iceberg-rest:8181
fs.native-s3.enabled=true
s3.endpoint=https://s3.example.com
postgresql: |
connector.name=postgresql
connection-url=jdbc:postgresql://pg:5432/crm
connection-user=trino
connection-password=${ENV:PG_PASSWORD}How JVM Heap Relates to Container Memory
The most common mistake is setting the JVM heap equal to the container memory limit. Trino uses native memory, metaspace, and OS buffers on top of the heap. Leave headroom so the container doesn't get OOMKilled.
Container memory limit ≈ JVM maxHeapSize + headroom (typically 20–30%)
e.g. limit 28Gi → maxHeapSize 24G (about 4Gi headroom)Additionally, query.max-memory-per-node must be smaller than the JVM heap, and memory.heap-headroom-per-node reserves a portion of the heap.
4. Injecting Catalogs and Secrets
Catalogs go into ConfigMaps (the additionalCatalogs above); passwords are kept separate in Secrets.
# Inject the Secret into workers/coordinator as environment variables (values.yaml)
worker:
envFrom:
- secretRef:
name: trino-secrets
coordinator:
envFrom:
- secretRef:
name: trino-secretskubectl create secret generic trino-secrets -n trino \
--from-literal=PG_PASSWORD='***' \
--from-literal=LDAP_BIND_PASSWORD='***'Referencing ${ENV:PG_PASSWORD} inside the catalog properties means no secrets end up in the ConfigMap that gets committed to Git.
5. Autoscaling — Matching Workers to Load
Workers are close to stateless, so horizontal scaling comes naturally.
5.1 HPA (CPU-based)
# values.yaml
worker:
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 20
targetCPUUtilizationPercentage: 70When CPU utilization exceeds 70%, worker Pods are added. Simple, but limited — Trino queries are highly bursty, so CPU can spike and settle before scaling has a chance to catch up.
5.2 KEDA (Query/Schedule-based)
Smarter scaling is done with KEDA. You can use the number of queued queries or queue length as a metric, or configure schedule-based scaling that only adds workers during business hours.
# KEDA ScaledObject example (conceptual)
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
query: trino_queued_queries # number of queued queries
threshold: "5"
- type: cron # guaranteed minimum during business hours
metadata:
start: "0 8 * * 1-5"
end: "0 20 * * 1-5"
desiredReplicas: "8"Scaling out is easy; scaling in is the dangerous part. If a worker being removed is still executing a query, that query fails. This is why graceful shutdown is essential.
6. Graceful Shutdown — Removing Workers Without Killing Queries
Trino workers support graceful shutdown. On receiving a termination signal, the worker stops accepting new tasks (entering the SHUTTING_DOWN state) and waits for in-flight work to finish before exiting.
The Helm chart's worker Pods implement this with a preStop hook and terminationGracePeriodSeconds. The key is to give a generous grace period.
worker:
# Time for in-flight queries to finish. Set it longer than your workload's longest query
terminationGracePeriodSeconds: 300
gracefulShutdown:
enabled: true
gracePeriodSeconds: 120The sequence of events:
1. K8s sends SIGTERM to the Pod + runs the preStop hook
2. Worker transitions to SHUTTING_DOWN → coordinator stops sending new tasks
3. Wait for in-flight tasks to complete
4. Terminate within terminationGracePeriodSecondsIf the grace period is shorter than your query duration, K8s force-kills the Pod with SIGKILL and the query breaks. If you run long ETL jobs, set a generous grace period — or combine it with FTE from section 7.
7. Fault-tolerant Execution — Running Confidently on Spot Instances
Graceful shutdown handles "announced" terminations, but it cannot protect against sudden worker loss like a spot instance reclaim. That's where Fault-tolerant Execution (FTE) comes in. Intermediate results are spooled to external storage (the exchange manager), and when a worker dies, only the affected tasks are retried on other workers.
# values.yaml
server:
config:
retryPolicy: "TASK" # or QUERY
exchangeManager:
name: filesystem
baseDir: "s3://trino-exchange/spool"# Generated as exchange-manager.properties
exchange-manager.name=filesystem
exchange.base-directories=s3://trino-exchange/spool| retryPolicy | Behavior | Best for |
|---|---|---|
QUERY | Retries the entire query | Short interactive queries |
TASK | Retries only failed tasks | Long-running ETL batches |
With FTE enabled you can run workers 100% on spot while keeping batch jobs reliable — a substantial cost saving. The trade-off is spool I/O overhead, so it's better left off for clusters dedicated to ultra-low-latency interactive queries.
8. Scheduling — Node Placement and Stability
# Coordinator on stable on-demand nodes, workers on spot
coordinator:
nodeSelector:
node-pool: on-demand
worker:
nodeSelector:
node-pool: spot
tolerations:
- key: "spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
# Spread workers across nodes to reduce the risk of losing many at once
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnywayPrinciples:
- Never put the coordinator on spot. If the coordinator dies, the entire cluster stops (it's a SPOF). Guarantee it an on-demand node with sufficient resources.
- Workers on spot + FTE is the cost-efficient combination.
- Use
topologySpreadConstraintsto spread workers across nodes/zones, mitigating scenarios where many disappear simultaneously.
9. Observability — Prometheus and Health Checks
Trino exposes JMX metrics. Collect them into Prometheus via a JMX exporter sidecar or the built-in metrics endpoint.
serviceMonitor: # When using the Prometheus Operator
enabled: true
labels:
release: prometheusKey metrics to monitor:
| Metric | Meaning |
|---|---|
trino_running_queries / trino_queued_queries | Running/queued query counts (scaling trigger) |
| Cluster memory utilization | OOM risk |
| Worker node count | Discovery health |
| Failed query rate | Stability |
| GC pause | JVM heap pressure |
Use the coordinator's /v1/info endpoint for readiness, and configure the readiness probe so no traffic is sent to nodes still in the starting state.
10. Deployment Checklist
- Pin the image tag (never
latest) - Coordinator on on-demand, workers on a separate spot node pool
- JVM heap < container limit (20–30% headroom)
- Catalogs in ConfigMaps, secrets in Secrets
- HPA/KEDA autoscaling configured
- Graceful shutdown grace period ≥ longest query duration
- Long-running batches use FTE (retryPolicy=TASK) + exchange manager
- Shared secret for internal communication + TLS
- Prometheus metrics and readiness probes configured
- PodDisruptionBudget to limit concurrent scale-down
11. Summary
| Operational challenge | Solution on Kubernetes |
|---|---|
| Load fluctuation | HPA (CPU) or KEDA (queries/schedule) |
| Announced worker termination | Graceful shutdown + a generous grace period |
| Sudden worker loss (spot) | FTE (retryPolicy=TASK) + exchange manager |
| Coordinator SPOF | On-demand node + guaranteed resources |
| Secret management | Separate ConfigMap/Secret, env-var substitution |
| Many workers lost at once | topologySpread + PodDisruptionBudget |
Trino's stateless architecture pairs well with Kubernetes, but the key to production stability is handling the reality that "a worker can disappear in the middle of a distributed query" with two mechanisms: graceful shutdown and FTE. Add the spot instance + FTE combination on top, and you can dramatically cut compute costs without sacrificing reliability.
This article is based on the official Trino Helm chart and the Trino 440-series releases. If you need help with Trino deployment, autoscaling, or cost optimization on Kubernetes, feel free to reach out.
— The Data Dynamics Engineering Team