Blog
trinokuberneteshelmautoscalingdevopsdata-platform

Deploying Trino on Kubernetes — Helm, Autoscaling, and Graceful Shutdown

A hands-on guide to running a Trino cluster on Kubernetes with the official Helm chart. Covers coordinator/worker configuration, catalog injection, resource and HPA autoscaling, graceful shutdown, the exchange manager for FTE, and running on spot instances.

Data DynamicsJune 5, 20268 min read

Trino is a nearly stateless distributed engine made up of one coordinator and N workers. That property is a great fit for Kubernetes — workers are easy to scale up and down as Pods, can be autoscaled based on load, and can run on spot instances to cut costs. The catch is that you have to deal with classic distributed-systems questions like "what happens if a worker disappears mid-query?"

This post walks through deploying Trino on Kubernetes reliably using the official Helm chart, including operational concerns like autoscaling and graceful shutdown.

1. Architecture — Trino on Kubernetes

                  Ingress / LoadBalancer (TLS termination)


                  ┌───────────────────┐
                  │  Coordinator Pod  │  (Deployment, replica 1)
                  │  Service :8080    │
                  └───────────────────┘
                            │  discovery + internal communication
        ┌───────────────────┼───────────────────┐
        ▼                   ▼                   ▼
   Worker Pod          Worker Pod          Worker Pod   (Deployment, replica N)
        └──── HPA adjusts replicas based on load ────┘
 
   Catalogs/config: injected via ConfigMap · Secret
   FTE spool: S3/GCS (exchange manager)

The key mapping:

Trino conceptKubernetes resource
CoordinatorDeployment (replica 1) + Service
WorkersDeployment (replica N)
Node discoveryCoordinator Service DNS
Catalogs/configConfigMap
Passwords & secretsSecret
AutoscalingHPA (or KEDA)
FTE spool storageExternal object storage

2. Getting Started with the Official Helm Chart

Trino ships an official Helm chart (trinodb/charts).

helm repo add trino https://trinodb.github.io/charts
helm repo update
 
# Default installation
helm install my-trino trino/trino --namespace trino --create-namespace

The defaults bring up 1 coordinator + 2 workers, but for production you should author your own values.yaml.

3. values.yaml — Core Configuration

image:
  tag: "version-pin"   # Always pin the version. Never use latest
 
server:
  workers: 4
  config:
    query:
      maxMemoryPerNode: "12GB"
    # Shared secret (internal communication) — inject via a Secret in practice
  coordinatorExtraConfig: |
    query.max-memory=80GB
  exchangeManager:
    name: filesystem        # For FTE. See section 7 below
 
coordinator:
  jvm:
    maxHeapSize: "16G"
  resources:
    requests:
      cpu: 4
      memory: 18Gi
    limits:
      memory: 18Gi
 
worker:
  jvm:
    maxHeapSize: "24G"
  resources:
    requests:
      cpu: 8
      memory: 28Gi
    limits:
      memory: 28Gi
  # Graceful shutdown so workers don't die mid-query (see section 6)
  terminationGracePeriodSeconds: 120
 
additionalCatalogs:
  iceberg: |
    connector.name=iceberg
    iceberg.catalog.type=rest
    iceberg.rest-catalog.uri=http://iceberg-rest:8181
    fs.native-s3.enabled=true
    s3.endpoint=https://s3.example.com
  postgresql: |
    connector.name=postgresql
    connection-url=jdbc:postgresql://pg:5432/crm
    connection-user=trino
    connection-password=${ENV:PG_PASSWORD}

How JVM Heap Relates to Container Memory

The most common mistake is setting the JVM heap equal to the container memory limit. Trino uses native memory, metaspace, and OS buffers on top of the heap. Leave headroom so the container doesn't get OOMKilled.

Container memory limit  ≈  JVM maxHeapSize  +  headroom (typically 20–30%)
e.g. limit 28Gi  →  maxHeapSize 24G  (about 4Gi headroom)

Additionally, query.max-memory-per-node must be smaller than the JVM heap, and memory.heap-headroom-per-node reserves a portion of the heap.

4. Injecting Catalogs and Secrets

Catalogs go into ConfigMaps (the additionalCatalogs above); passwords are kept separate in Secrets.

# Inject the Secret into workers/coordinator as environment variables (values.yaml)
worker:
  envFrom:
    - secretRef:
        name: trino-secrets
coordinator:
  envFrom:
    - secretRef:
        name: trino-secrets
kubectl create secret generic trino-secrets -n trino \
  --from-literal=PG_PASSWORD='***' \
  --from-literal=LDAP_BIND_PASSWORD='***'

Referencing ${ENV:PG_PASSWORD} inside the catalog properties means no secrets end up in the ConfigMap that gets committed to Git.

5. Autoscaling — Matching Workers to Load

Workers are close to stateless, so horizontal scaling comes naturally.

5.1 HPA (CPU-based)

# values.yaml
worker:
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 20
    targetCPUUtilizationPercentage: 70

When CPU utilization exceeds 70%, worker Pods are added. Simple, but limited — Trino queries are highly bursty, so CPU can spike and settle before scaling has a chance to catch up.

5.2 KEDA (Query/Schedule-based)

Smarter scaling is done with KEDA. You can use the number of queued queries or queue length as a metric, or configure schedule-based scaling that only adds workers during business hours.

# KEDA ScaledObject example (conceptual)
triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      query: trino_queued_queries        # number of queued queries
      threshold: "5"
  - type: cron                            # guaranteed minimum during business hours
    metadata:
      start: "0 8 * * 1-5"
      end:   "0 20 * * 1-5"
      desiredReplicas: "8"

Scaling out is easy; scaling in is the dangerous part. If a worker being removed is still executing a query, that query fails. This is why graceful shutdown is essential.

6. Graceful Shutdown — Removing Workers Without Killing Queries

Trino workers support graceful shutdown. On receiving a termination signal, the worker stops accepting new tasks (entering the SHUTTING_DOWN state) and waits for in-flight work to finish before exiting.

The Helm chart's worker Pods implement this with a preStop hook and terminationGracePeriodSeconds. The key is to give a generous grace period.

worker:
  # Time for in-flight queries to finish. Set it longer than your workload's longest query
  terminationGracePeriodSeconds: 300
  gracefulShutdown:
    enabled: true
    gracePeriodSeconds: 120

The sequence of events:

1. K8s sends SIGTERM to the Pod + runs the preStop hook
2. Worker transitions to SHUTTING_DOWN → coordinator stops sending new tasks
3. Wait for in-flight tasks to complete
4. Terminate within terminationGracePeriodSeconds

If the grace period is shorter than your query duration, K8s force-kills the Pod with SIGKILL and the query breaks. If you run long ETL jobs, set a generous grace period — or combine it with FTE from section 7.

7. Fault-tolerant Execution — Running Confidently on Spot Instances

Graceful shutdown handles "announced" terminations, but it cannot protect against sudden worker loss like a spot instance reclaim. That's where Fault-tolerant Execution (FTE) comes in. Intermediate results are spooled to external storage (the exchange manager), and when a worker dies, only the affected tasks are retried on other workers.

# values.yaml
server:
  config:
    retryPolicy: "TASK"        # or QUERY
  exchangeManager:
    name: filesystem
    baseDir: "s3://trino-exchange/spool"
# Generated as exchange-manager.properties
exchange-manager.name=filesystem
exchange.base-directories=s3://trino-exchange/spool
retryPolicyBehaviorBest for
QUERYRetries the entire queryShort interactive queries
TASKRetries only failed tasksLong-running ETL batches

With FTE enabled you can run workers 100% on spot while keeping batch jobs reliable — a substantial cost saving. The trade-off is spool I/O overhead, so it's better left off for clusters dedicated to ultra-low-latency interactive queries.

8. Scheduling — Node Placement and Stability

# Coordinator on stable on-demand nodes, workers on spot
coordinator:
  nodeSelector:
    node-pool: on-demand
worker:
  nodeSelector:
    node-pool: spot
  tolerations:
    - key: "spot"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"
  # Spread workers across nodes to reduce the risk of losing many at once
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: ScheduleAnyway

Principles:

  • Never put the coordinator on spot. If the coordinator dies, the entire cluster stops (it's a SPOF). Guarantee it an on-demand node with sufficient resources.
  • Workers on spot + FTE is the cost-efficient combination.
  • Use topologySpreadConstraints to spread workers across nodes/zones, mitigating scenarios where many disappear simultaneously.

9. Observability — Prometheus and Health Checks

Trino exposes JMX metrics. Collect them into Prometheus via a JMX exporter sidecar or the built-in metrics endpoint.

serviceMonitor:        # When using the Prometheus Operator
  enabled: true
  labels:
    release: prometheus

Key metrics to monitor:

MetricMeaning
trino_running_queries / trino_queued_queriesRunning/queued query counts (scaling trigger)
Cluster memory utilizationOOM risk
Worker node countDiscovery health
Failed query rateStability
GC pauseJVM heap pressure

Use the coordinator's /v1/info endpoint for readiness, and configure the readiness probe so no traffic is sent to nodes still in the starting state.

10. Deployment Checklist

  • Pin the image tag (never latest)
  • Coordinator on on-demand, workers on a separate spot node pool
  • JVM heap < container limit (20–30% headroom)
  • Catalogs in ConfigMaps, secrets in Secrets
  • HPA/KEDA autoscaling configured
  • Graceful shutdown grace period ≥ longest query duration
  • Long-running batches use FTE (retryPolicy=TASK) + exchange manager
  • Shared secret for internal communication + TLS
  • Prometheus metrics and readiness probes configured
  • PodDisruptionBudget to limit concurrent scale-down

11. Summary

Operational challengeSolution on Kubernetes
Load fluctuationHPA (CPU) or KEDA (queries/schedule)
Announced worker terminationGraceful shutdown + a generous grace period
Sudden worker loss (spot)FTE (retryPolicy=TASK) + exchange manager
Coordinator SPOFOn-demand node + guaranteed resources
Secret managementSeparate ConfigMap/Secret, env-var substitution
Many workers lost at oncetopologySpread + PodDisruptionBudget

Trino's stateless architecture pairs well with Kubernetes, but the key to production stability is handling the reality that "a worker can disappear in the middle of a distributed query" with two mechanisms: graceful shutdown and FTE. Add the spot instance + FTE combination on top, and you can dramatically cut compute costs without sacrificing reliability.


This article is based on the official Trino Helm chart and the Trino 440-series releases. If you need help with Trino deployment, autoscaling, or cost optimization on Kubernetes, feel free to reach out.

— The Data Dynamics Engineering Team