Trino Fault-tolerant Execution Deep Dive — Exchange Manager and Retry Policies
A deep look at Trino's FTE, which keeps queries alive even when workers die in the middle of long-running ETL batches. QUERY vs TASK retry policies, exchange manager (spool storage) configuration, running on spot instances, and why you should not enable it for interactive queries.
Trino's default execution model is "all-or-nothing." If even one task in a query fails, the entire query fails. For short interactive queries that's no big deal — you just resubmit. But it's a different story when a multi-hour ETL batch gets wiped out in its final stage because a single worker died.
Fault-tolerant Execution (FTE) solves this problem. Intermediate results are spooled to external storage, and only the failed units are retried on other workers. This post covers how FTE works internally, how to configure it, and when to turn it on — and off.
1. Why FTE Is Needed — Limits of the Default Execution Model
In normal mode, Trino streams data between stages directly through memory and the network (streaming exchange). It's fast, but the intermediate data is not persisted anywhere.
[Normal mode] Stage1 ──in-memory streaming──> Stage2 ──> Stage3
If even one worker dies → entire query FAILS → start over from scratchThe larger the data and the longer the execution, the higher the probability that a worker dies mid-flight. With spot instances, workers disappear regularly due to reclaims, making this probability even higher. Rerunning a long batch from the beginning every time wastes both money and time.
2. The Core Idea of FTE — Spooling Intermediate Results
Instead of streaming data between stages directly, FTE spools it to external storage (S3/GCS/HDFS, etc.) via the exchange manager. Because the intermediate results are preserved, execution can resume from that point even if a worker dies.
[FTE mode] Stage1 ──> [Exchange: S3 spool] ──> Stage2 ──> [Exchange: S3 spool] ──> Stage3
If a worker dies → only the failed part is retried from the spooled intermediate results| Normal (streaming) | FTE (spooling) | |
|---|---|---|
| Data between stages | Passed directly via memory/network | Spooled to external storage by the exchange manager |
| On worker loss | Entire query fails | Only the failed unit is retried |
| Latency | Low | Increased by spool I/O |
| Best for | Interactive queries | Long-running ETL batches |
3. Two Retry Policies — QUERY vs TASK
FTE determines the retry granularity via retry-policy.
# etc/config.properties
retry-policy=TASK| Policy | Retry unit | Behavior | Best for |
|---|---|---|---|
QUERY | Entire query | Reruns the whole query on failure | Interactive clusters with many short queries |
TASK | Individual task | Retries only the failed task on another worker | Long-running, large-scale ETL batches |
QUERY Policy
When a query fails, the coordinator automatically reruns the whole thing. The user never sees the failure — they just get the result. It suits short queries, and it can work without an exchange manager (small results fit in coordinator memory), but an exchange manager is recommended for large queries.
TASK Policy
Retries happen at the granularity of the individual tasks that make up the query. An exchange manager is mandatory. In a long-running batch, if one worker dies, only the tasks it was running are handed off to surviving workers and re-executed — the rest of the query is not rerun. TASK mode also enables adaptive task scheduling, so it responds flexibly to changes in worker count (autoscaling, spot).
Rule of thumb: interactive clusters use
QUERY, ETL/batch clusters useTASK. If you want to mix both on a single cluster, you can switch at the session level —SET SESSION retry_policy = 'TASK'.
4. Configuring the Exchange Manager
The TASK policy (and large QUERY workloads) requires an exchange manager. You specify the spool storage location.
# etc/exchange-manager.properties
exchange-manager.name=filesystem
exchange.base-directories=s3://trino-exchange/spool
exchange.encryption-enabled=trueS3 credentials and the endpoint follow the filesystem configuration.
# S3 access settings in the same file or config
exchange.s3.region=ap-northeast-2
exchange.s3.endpoint=https://s3.ap-northeast-2.amazonaws.comChoosing Spool Storage
| Storage | Advantages | Considerations |
|---|---|---|
| S3 / GCS / Azure Blob | Unlimited capacity, decoupled from the cluster | Network latency, request costs |
| HDFS | Leverages existing on-premises assets | NameNode load |
| Local disk (not shared across workers) | Fast | Spool is lost when the worker is lost → FTE is pointless |
The key point: the spool storage must have a lifecycle independent of the workers. The spool has to survive a worker's death for retries to be possible, so local disk is unsuitable and object storage is the standard choice.
Spreading Across Multiple Locations
In large clusters, spool I/O can become a bottleneck, so spread it across multiple directories/buckets.
exchange.base-directories=s3://trino-exchange-1/spool,s3://trino-exchange-2/spool5. Related Tuning Parameters
These options fine-tune FTE's behavior.
| Parameter | Role |
|---|---|
fault-tolerant-execution-task-memory | Estimated memory per task in TASK mode (scheduling unit) |
fault-tolerant-execution-target-task-input-size | Target task input size (task split granularity) |
fault-tolerant-execution-target-task-split-count | Target number of splits per task |
fault-tolerant-execution-max-task-split-count | Upper bound on splits per task |
exchange.compression-enabled | Compress spool data to cut I/O and cost |
query.low-memory-killer.policy | Policy for choosing what to kill under memory pressure |
Compression is almost always worth enabling (it cuts network and storage costs). For task input size, too small adds overhead while too large raises retry costs — tune it to your workload.
6. FTE and Spot Instances — The Key to Cost Savings
FTE's most powerful use case is running workers on spot instances.
100% on-demand workers → stable but expensive
Spot workers + FTE(TASK) → task retries carry the job through reclaims, cutting costs by up to 70-90%Even when a worker vanishes due to a spot reclaim, the surviving workers pick up its tasks thanks to the spooled intermediate results. On Kubernetes, the standard pattern is an on-demand coordinator with spot workers plus FTE. (Kubernetes deployment is covered in a separate post, "Deploying Trino on Kubernetes.")
7. Trade-offs — What You Give Up by Enabling FTE
FTE is not free.
- Higher latency: every stage boundary writes to and reads back from external storage, so short queries actually get slower. It is unsuitable for millisecond-to-second interactive queries.
- Storage cost/requests: you pay for S3 PUT/GET requests and stored capacity. Mitigate with compression.
- Operational complexity: you must manage the exchange manager, spool buckets, permissions, and cleanup (lifecycle) policies.
| Workload | Recommendation |
|---|---|
| BI dashboards, interactive analytics | FTE off (streaming) |
| Many short queries | QUERY (optional) |
| Long-running ETL, large joins/aggregations | TASK + exchange manager |
| Spot worker clusters | TASK + exchange manager (close to mandatory) |
8. Verification — Confirming FTE Actually Works
-- Check the current session's retry policy
SHOW SESSION LIKE 'retry_policy';
-- Switch at the session level (FTE for just one heavy batch on an interactive cluster)
SET SESSION retry_policy = 'TASK';After running, you can check the coordinator Web UI's Query Detail page for retried tasks and confirm that the exchange operated in spooling mode. We recommend a spot-reclaim test: deliberately kill a worker and verify the query runs to completion.
9. Cleaning Up Spool Storage (Lifecycle)
Trino cleans up spool data when a query finishes, but abnormal terminations and interruptions can leave orphaned objects behind. Attach a lifecycle policy to the bucket (e.g., automatically delete objects under the prefix spool/ after 1-3 days) to plug the cost leak.
S3 Lifecycle Rule: prefix "spool/", expire after 2 days10. Summary
| Item | Normal mode | FTE mode |
|---|---|---|
| Failure handling | Entire query fails | Task/query-level retry |
| Intermediate data | In-memory streaming | Spooled via exchange manager |
| Latency | Low | Increased by spool I/O |
| Spot operation | Risky | Safe (major cost savings) |
| Suitable workload | Interactive | Long-running batches |
FTE is a feature where you "pay a bit of latency for resilience." Leave it off for interactive clusters; for long-running ETL and spot-based batch clusters, configure the TASK policy with an object-storage exchange manager. Enable spool compression and attach a lifecycle policy to the bucket, and you get both resilience and cost control.
This article is based on the Trino 440 line. If you need help stabilizing large-scale batches or optimizing costs with spot instances, feel free to reach out.
— Data Dynamics Engineering Team