Blog
trinofault-tolerant-executionetlexchange-managerdata-platform

Trino Fault-tolerant Execution Deep Dive — Exchange Manager and Retry Policies

A deep look at Trino's FTE, which keeps queries alive even when workers die in the middle of long-running ETL batches. QUERY vs TASK retry policies, exchange manager (spool storage) configuration, running on spot instances, and why you should not enable it for interactive queries.

Data DynamicsJune 5, 20267 min read

Trino's default execution model is "all-or-nothing." If even one task in a query fails, the entire query fails. For short interactive queries that's no big deal — you just resubmit. But it's a different story when a multi-hour ETL batch gets wiped out in its final stage because a single worker died.

Fault-tolerant Execution (FTE) solves this problem. Intermediate results are spooled to external storage, and only the failed units are retried on other workers. This post covers how FTE works internally, how to configure it, and when to turn it on — and off.

1. Why FTE Is Needed — Limits of the Default Execution Model

In normal mode, Trino streams data between stages directly through memory and the network (streaming exchange). It's fast, but the intermediate data is not persisted anywhere.

[Normal mode] Stage1 ──in-memory streaming──> Stage2 ──> Stage3
              If even one worker dies → entire query FAILS → start over from scratch

The larger the data and the longer the execution, the higher the probability that a worker dies mid-flight. With spot instances, workers disappear regularly due to reclaims, making this probability even higher. Rerunning a long batch from the beginning every time wastes both money and time.

2. The Core Idea of FTE — Spooling Intermediate Results

Instead of streaming data between stages directly, FTE spools it to external storage (S3/GCS/HDFS, etc.) via the exchange manager. Because the intermediate results are preserved, execution can resume from that point even if a worker dies.

[FTE mode] Stage1 ──> [Exchange: S3 spool] ──> Stage2 ──> [Exchange: S3 spool] ──> Stage3
           If a worker dies → only the failed part is retried from the spooled intermediate results
Normal (streaming)FTE (spooling)
Data between stagesPassed directly via memory/networkSpooled to external storage by the exchange manager
On worker lossEntire query failsOnly the failed unit is retried
LatencyLowIncreased by spool I/O
Best forInteractive queriesLong-running ETL batches

3. Two Retry Policies — QUERY vs TASK

FTE determines the retry granularity via retry-policy.

# etc/config.properties
retry-policy=TASK
PolicyRetry unitBehaviorBest for
QUERYEntire queryReruns the whole query on failureInteractive clusters with many short queries
TASKIndividual taskRetries only the failed task on another workerLong-running, large-scale ETL batches

QUERY Policy

When a query fails, the coordinator automatically reruns the whole thing. The user never sees the failure — they just get the result. It suits short queries, and it can work without an exchange manager (small results fit in coordinator memory), but an exchange manager is recommended for large queries.

TASK Policy

Retries happen at the granularity of the individual tasks that make up the query. An exchange manager is mandatory. In a long-running batch, if one worker dies, only the tasks it was running are handed off to surviving workers and re-executed — the rest of the query is not rerun. TASK mode also enables adaptive task scheduling, so it responds flexibly to changes in worker count (autoscaling, spot).

Rule of thumb: interactive clusters use QUERY, ETL/batch clusters use TASK. If you want to mix both on a single cluster, you can switch at the session level — SET SESSION retry_policy = 'TASK'.

4. Configuring the Exchange Manager

The TASK policy (and large QUERY workloads) requires an exchange manager. You specify the spool storage location.

# etc/exchange-manager.properties
exchange-manager.name=filesystem
exchange.base-directories=s3://trino-exchange/spool
exchange.encryption-enabled=true

S3 credentials and the endpoint follow the filesystem configuration.

# S3 access settings in the same file or config
exchange.s3.region=ap-northeast-2
exchange.s3.endpoint=https://s3.ap-northeast-2.amazonaws.com

Choosing Spool Storage

StorageAdvantagesConsiderations
S3 / GCS / Azure BlobUnlimited capacity, decoupled from the clusterNetwork latency, request costs
HDFSLeverages existing on-premises assetsNameNode load
Local disk (not shared across workers)FastSpool is lost when the worker is lost → FTE is pointless

The key point: the spool storage must have a lifecycle independent of the workers. The spool has to survive a worker's death for retries to be possible, so local disk is unsuitable and object storage is the standard choice.

Spreading Across Multiple Locations

In large clusters, spool I/O can become a bottleneck, so spread it across multiple directories/buckets.

exchange.base-directories=s3://trino-exchange-1/spool,s3://trino-exchange-2/spool

These options fine-tune FTE's behavior.

ParameterRole
fault-tolerant-execution-task-memoryEstimated memory per task in TASK mode (scheduling unit)
fault-tolerant-execution-target-task-input-sizeTarget task input size (task split granularity)
fault-tolerant-execution-target-task-split-countTarget number of splits per task
fault-tolerant-execution-max-task-split-countUpper bound on splits per task
exchange.compression-enabledCompress spool data to cut I/O and cost
query.low-memory-killer.policyPolicy for choosing what to kill under memory pressure

Compression is almost always worth enabling (it cuts network and storage costs). For task input size, too small adds overhead while too large raises retry costs — tune it to your workload.

6. FTE and Spot Instances — The Key to Cost Savings

FTE's most powerful use case is running workers on spot instances.

100% on-demand workers     →  stable but expensive
Spot workers + FTE(TASK)   →  task retries carry the job through reclaims, cutting costs by up to 70-90%

Even when a worker vanishes due to a spot reclaim, the surviving workers pick up its tasks thanks to the spooled intermediate results. On Kubernetes, the standard pattern is an on-demand coordinator with spot workers plus FTE. (Kubernetes deployment is covered in a separate post, "Deploying Trino on Kubernetes.")

7. Trade-offs — What You Give Up by Enabling FTE

FTE is not free.

  • Higher latency: every stage boundary writes to and reads back from external storage, so short queries actually get slower. It is unsuitable for millisecond-to-second interactive queries.
  • Storage cost/requests: you pay for S3 PUT/GET requests and stored capacity. Mitigate with compression.
  • Operational complexity: you must manage the exchange manager, spool buckets, permissions, and cleanup (lifecycle) policies.
WorkloadRecommendation
BI dashboards, interactive analyticsFTE off (streaming)
Many short queriesQUERY (optional)
Long-running ETL, large joins/aggregationsTASK + exchange manager
Spot worker clustersTASK + exchange manager (close to mandatory)

8. Verification — Confirming FTE Actually Works

-- Check the current session's retry policy
SHOW SESSION LIKE 'retry_policy';
 
-- Switch at the session level (FTE for just one heavy batch on an interactive cluster)
SET SESSION retry_policy = 'TASK';

After running, you can check the coordinator Web UI's Query Detail page for retried tasks and confirm that the exchange operated in spooling mode. We recommend a spot-reclaim test: deliberately kill a worker and verify the query runs to completion.

9. Cleaning Up Spool Storage (Lifecycle)

Trino cleans up spool data when a query finishes, but abnormal terminations and interruptions can leave orphaned objects behind. Attach a lifecycle policy to the bucket (e.g., automatically delete objects under the prefix spool/ after 1-3 days) to plug the cost leak.

S3 Lifecycle Rule: prefix "spool/", expire after 2 days

10. Summary

ItemNormal modeFTE mode
Failure handlingEntire query failsTask/query-level retry
Intermediate dataIn-memory streamingSpooled via exchange manager
LatencyLowIncreased by spool I/O
Spot operationRiskySafe (major cost savings)
Suitable workloadInteractiveLong-running batches

FTE is a feature where you "pay a bit of latency for resilience." Leave it off for interactive clusters; for long-running ETL and spot-based batch clusters, configure the TASK policy with an object-storage exchange manager. Enable spool compression and attach a lifecycle policy to the bucket, and you get both resilience and cost control.


This article is based on the Trino 440 line. If you need help stabilizing large-scale batches or optimizing costs with spot instances, feel free to reach out.

— Data Dynamics Engineering Team