trinokuberneteshelmautoscalingdevopsdata-platform

Deploying Trino on Kubernetes — Helm, Autoscaling, and Graceful Shutdown

A hands-on guide to running a Trino cluster on Kubernetes with the official Helm chart. Covers coordinator/worker configuration, catalog injection, resource and HPA autoscaling, graceful shutdown, the exchange manager for FTE, and running on spot instances.

Data DynamicsJune 5, 20268 min read

Trino is a nearly stateless distributed engine made up of one coordinator and N workers. That property is a great fit for Kubernetes — workers are easy to scale up and down as Pods, can be autoscaled based on load, and can run on spot instances to cut costs. The catch is that you have to deal with classic distributed-systems questions like "what happens if a worker disappears mid-query?"

This post walks through deploying Trino on Kubernetes reliably using the official Helm chart, including operational concerns like autoscaling and graceful shutdown.

1. Architecture — Trino on Kubernetes

Loading diagram…

The key mapping:

Trino concept	Kubernetes resource
Coordinator	Deployment (replica 1) + Service
Workers	Deployment (replica N)
Node discovery	Coordinator Service DNS
Catalogs/config	ConfigMap
Passwords & secrets	Secret
Autoscaling	HPA (or KEDA)
FTE spool storage	External object storage

2. Getting Started with the Official Helm Chart

Trino ships an official Helm chart (trinodb/charts).

helm repo add trino https://trinodb.github.io/charts
helm repo update
 
# Default installation
helm install my-trino trino/trino --namespace trino --create-namespace

The defaults bring up 1 coordinator + 2 workers, but for production you should author your own values.yaml.

3. values.yaml — Core Configuration

image:
  tag: "version-pin"   # Always pin the version. Never use latest
 
server:
  workers: 4
  config:
    query:
      maxMemoryPerNode: "12GB"
    # Shared secret (internal communication) — inject via a Secret in practice
  coordinatorExtraConfig: |
    query.max-memory=80GB
  exchangeManager:
    name: filesystem        # For FTE. See section 7 below
 
coordinator:
  jvm:
    maxHeapSize: "16G"
  resources:
    requests:
      cpu: 4
      memory: 18Gi
    limits:
      memory: 18Gi
 
worker:
  jvm:
    maxHeapSize: "24G"
  resources:
    requests:
      cpu: 8
      memory: 28Gi
    limits:
      memory: 28Gi
  # Graceful shutdown so workers don't die mid-query (see section 6)
  terminationGracePeriodSeconds: 120
 
additionalCatalogs:
  iceberg: |
    connector.name=iceberg
    iceberg.catalog.type=rest
    iceberg.rest-catalog.uri=http://iceberg-rest:8181
    fs.native-s3.enabled=true
    s3.endpoint=https://s3.example.com
  postgresql: |
    connector.name=postgresql
    connection-url=jdbc:postgresql://pg:5432/crm
    connection-user=trino
    connection-password=${ENV:PG_PASSWORD}

How JVM Heap Relates to Container Memory

The most common mistake is setting the JVM heap equal to the container memory limit. Trino uses native memory, metaspace, and OS buffers on top of the heap. Leave headroom so the container doesn't get OOMKilled.

Container memory limit  ≈  JVM maxHeapSize  +  headroom (typically 20–30%)
e.g. limit 28Gi  →  maxHeapSize 24G  (about 4Gi headroom)

Additionally, query.max-memory-per-node must be smaller than the JVM heap, and memory.heap-headroom-per-node reserves a portion of the heap.

4. Injecting Catalogs and Secrets

Catalogs go into ConfigMaps (the additionalCatalogs above); passwords are kept separate in Secrets.

# Inject the Secret into workers/coordinator as environment variables (values.yaml)
worker:
  envFrom:
    - secretRef:
        name: trino-secrets
coordinator:
  envFrom:
    - secretRef:
        name: trino-secrets

kubectl create secret generic trino-secrets -n trino \
  --from-literal=PG_PASSWORD='***' \
  --from-literal=LDAP_BIND_PASSWORD='***'

Referencing ${ENV:PG_PASSWORD} inside the catalog properties means no secrets end up in the ConfigMap that gets committed to Git.

5. Autoscaling — Matching Workers to Load

Workers are close to stateless, so horizontal scaling comes naturally.

5.1 HPA (CPU-based)

# values.yaml
worker:
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 20
    targetCPUUtilizationPercentage: 70

When CPU utilization exceeds 70%, worker Pods are added. Simple, but limited — Trino queries are highly bursty, so CPU can spike and settle before scaling has a chance to catch up.

5.2 KEDA (Query/Schedule-based)

Smarter scaling is done with KEDA. You can use the number of queued queries or queue length as a metric, or configure schedule-based scaling that only adds workers during business hours.

# KEDA ScaledObject example (conceptual)
triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      query: trino_queued_queries        # number of queued queries
      threshold: "5"
  - type: cron                            # guaranteed minimum during business hours
    metadata:
      start: "0 8 * * 1-5"
      end:   "0 20 * * 1-5"
      desiredReplicas: "8"

Scaling out is easy; scaling in is the dangerous part. If a worker being removed is still executing a query, that query fails. This is why graceful shutdown is essential.

6. Graceful Shutdown — Removing Workers Without Killing Queries

Trino workers support graceful shutdown. On receiving a termination signal, the worker stops accepting new tasks (entering the SHUTTING_DOWN state) and waits for in-flight work to finish before exiting.

The Helm chart's worker Pods implement this with a preStop hook and terminationGracePeriodSeconds. The key is to give a generous grace period.

worker:
  # Time for in-flight queries to finish. Set it longer than your workload's longest query
  terminationGracePeriodSeconds: 300
  gracefulShutdown:
    enabled: true
    gracePeriodSeconds: 120

The sequence of events:

Loading diagram…

If the grace period is shorter than your query duration, K8s force-kills the Pod with SIGKILL and the query breaks. If you run long ETL jobs, set a generous grace period — or combine it with FTE from section 7.

7. Fault-tolerant Execution — Running Confidently on Spot Instances

Graceful shutdown handles "announced" terminations, but it cannot protect against sudden worker loss like a spot instance reclaim. That's where Fault-tolerant Execution (FTE) comes in. Intermediate results are spooled to external storage (the exchange manager), and when a worker dies, only the affected tasks are retried on other workers.

# values.yaml
server:
  config:
    retryPolicy: "TASK"        # or QUERY
  exchangeManager:
    name: filesystem
    baseDir: "s3://trino-exchange/spool"

# Generated as exchange-manager.properties
exchange-manager.name=filesystem
exchange.base-directories=s3://trino-exchange/spool

retryPolicy	Behavior	Best for
`QUERY`	Retries the entire query	Short interactive queries
`TASK`	Retries only failed tasks	Long-running ETL batches

With FTE enabled you can run workers 100% on spot while keeping batch jobs reliable — a substantial cost saving. The trade-off is spool I/O overhead, so it's better left off for clusters dedicated to ultra-low-latency interactive queries.

8. Scheduling — Node Placement and Stability

# Coordinator on stable on-demand nodes, workers on spot
coordinator:
  nodeSelector:
    node-pool: on-demand
worker:
  nodeSelector:
    node-pool: spot
  tolerations:
    - key: "spot"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"
  # Spread workers across nodes to reduce the risk of losing many at once
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: ScheduleAnyway

Principles:

Never put the coordinator on spot. If the coordinator dies, the entire cluster stops (it's a SPOF). Guarantee it an on-demand node with sufficient resources.
Workers on spot + FTE is the cost-efficient combination.
Use topologySpreadConstraints to spread workers across nodes/zones, mitigating scenarios where many disappear simultaneously.

9. Observability — Prometheus and Health Checks

Trino exposes JMX metrics. Collect them into Prometheus via a JMX exporter sidecar or the built-in metrics endpoint.

serviceMonitor:        # When using the Prometheus Operator
  enabled: true
  labels:
    release: prometheus

Key metrics to monitor:

Metric	Meaning
`trino_running_queries` / `trino_queued_queries`	Running/queued query counts (scaling trigger)
Cluster memory utilization	OOM risk
Worker node count	Discovery health
Failed query rate	Stability
GC pause	JVM heap pressure

Use the coordinator's /v1/info endpoint for readiness, and configure the readiness probe so no traffic is sent to nodes still in the starting state.

10. Deployment Checklist

11. Summary

Operational challenge	Solution on Kubernetes
Load fluctuation	HPA (CPU) or KEDA (queries/schedule)
Announced worker termination	Graceful shutdown + a generous grace period
Sudden worker loss (spot)	FTE (retryPolicy=TASK) + exchange manager
Coordinator SPOF	On-demand node + guaranteed resources
Secret management	Separate ConfigMap/Secret, env-var substitution
Many workers lost at once	topologySpread + PodDisruptionBudget

Trino's stateless architecture pairs well with Kubernetes, but the key to production stability is handling the reality that "a worker can disappear in the middle of a distributed query" with two mechanisms: graceful shutdown and FTE. Add the spot instance + FTE combination on top, and you can dramatically cut compute costs without sacrificing reliability.

This article is based on the official Trino Helm chart and the Trino 440-series releases. If you need help with Trino deployment, autoscaling, or cost optimization on Kubernetes, feel free to reach out.

— The Data Dynamics Engineering Team