trinofault-tolerant-executionetlexchange-managerdata-platform

Trino Fault-tolerant Execution Deep Dive — Exchange Manager and Retry Policies

A deep look at Trino's FTE, which keeps queries alive even when workers die in the middle of long-running ETL batches. QUERY vs TASK retry policies, exchange manager (spool storage) configuration, running on spot instances, and why you should not enable it for interactive queries.

Data DynamicsJune 5, 20268 min read

Trino's default execution model is "all-or-nothing." If even one task in a query fails, the entire query fails. For short interactive queries that's no big deal — you just resubmit. But it's a different story when a multi-hour ETL batch gets wiped out in its final stage because a single worker died.

Fault-tolerant Execution (FTE) solves this problem. Intermediate results are spooled to external storage, and only the failed units are retried on other workers. This post covers how FTE works internally, how to configure it, and when to turn it on — and off.

1. Why FTE Is Needed — Limits of the Default Execution Model

In normal mode, Trino streams data between stages directly through memory and the network (streaming exchange). It's fast, but the intermediate data is not persisted anywhere.

Loading diagram…

The larger the data and the longer the execution, the higher the probability that a worker dies mid-flight. With spot instances, workers disappear regularly due to reclaims, making this probability even higher. Rerunning a long batch from the beginning every time wastes both money and time.

2. The Core Idea of FTE — Spooling Intermediate Results

Instead of streaming data between stages directly, FTE spools it to external storage (S3/GCS/HDFS, etc.) via the exchange manager. Because the intermediate results are preserved, execution can resume from that point even if a worker dies.

Loading diagram…

	Normal (streaming)	FTE (spooling)
Data between stages	Passed directly via memory/network	Spooled to external storage by the exchange manager
On worker loss	Entire query fails	Only the failed unit is retried
Latency	Low	Increased by spool I/O
Best for	Interactive queries	Long-running ETL batches

3. Two Retry Policies — QUERY vs TASK

FTE determines the retry granularity via retry-policy.

# etc/config.properties
retry-policy=TASK

Policy	Retry unit	Behavior	Best for
`QUERY`	Entire query	Reruns the whole query on failure	Interactive clusters with many short queries
`TASK`	Individual task	Retries only the failed task on another worker	Long-running, large-scale ETL batches

QUERY Policy

When a query fails, the coordinator automatically reruns the whole thing. The user never sees the failure — they just get the result. It suits short queries, and it can work without an exchange manager (small results fit in coordinator memory), but an exchange manager is recommended for large queries.

TASK Policy

Retries happen at the granularity of the individual tasks that make up the query. An exchange manager is mandatory. In a long-running batch, if one worker dies, only the tasks it was running are handed off to surviving workers and re-executed — the rest of the query is not rerun. TASK mode also enables adaptive task scheduling, so it responds flexibly to changes in worker count (autoscaling, spot).

Rule of thumb: interactive clusters use QUERY, ETL/batch clusters use TASK. If you want to mix both on a single cluster, you can switch at the session level — SET SESSION retry_policy = 'TASK'.

4. Configuring the Exchange Manager

The TASK policy (and large QUERY workloads) requires an exchange manager. You specify the spool storage location.

# etc/exchange-manager.properties
exchange-manager.name=filesystem
exchange.base-directories=s3://trino-exchange/spool
exchange.encryption-enabled=true

S3 credentials and the endpoint follow the filesystem configuration.

# S3 access settings in the same file or config
exchange.s3.region=ap-northeast-2
exchange.s3.endpoint=https://s3.ap-northeast-2.amazonaws.com

Choosing Spool Storage

Storage	Advantages	Considerations
S3 / GCS / Azure Blob	Unlimited capacity, decoupled from the cluster	Network latency, request costs
HDFS	Leverages existing on-premises assets	NameNode load
Local disk (not shared across workers)	Fast	Spool is lost when the worker is lost → FTE is pointless

The key point: the spool storage must have a lifecycle independent of the workers. The spool has to survive a worker's death for retries to be possible, so local disk is unsuitable and object storage is the standard choice.

Spreading Across Multiple Locations

In large clusters, spool I/O can become a bottleneck, so spread it across multiple directories/buckets.

exchange.base-directories=s3://trino-exchange-1/spool,s3://trino-exchange-2/spool

These options fine-tune FTE's behavior.

Parameter	Role
`fault-tolerant-execution-task-memory`	Estimated memory per task in TASK mode (scheduling unit)
`fault-tolerant-execution-target-task-input-size`	Target task input size (task split granularity)
`fault-tolerant-execution-target-task-split-count`	Target number of splits per task
`fault-tolerant-execution-max-task-split-count`	Upper bound on splits per task
`exchange.compression-enabled`	Compress spool data to cut I/O and cost
`query.low-memory-killer.policy`	Policy for choosing what to kill under memory pressure

Compression is almost always worth enabling (it cuts network and storage costs). For task input size, too small adds overhead while too large raises retry costs — tune it to your workload.

6. FTE and Spot Instances — The Key to Cost Savings

FTE's most powerful use case is running workers on spot instances.

Loading diagram…

Even when a worker vanishes due to a spot reclaim, the surviving workers pick up its tasks thanks to the spooled intermediate results. On Kubernetes, the standard pattern is an on-demand coordinator with spot workers plus FTE. (Kubernetes deployment is covered in a separate post, "Deploying Trino on Kubernetes.")

7. Trade-offs — What You Give Up by Enabling FTE

FTE is not free.

Higher latency: every stage boundary writes to and reads back from external storage, so short queries actually get slower. It is unsuitable for millisecond-to-second interactive queries.
Storage cost/requests: you pay for S3 PUT/GET requests and stored capacity. Mitigate with compression.
Operational complexity: you must manage the exchange manager, spool buckets, permissions, and cleanup (lifecycle) policies.

Workload	Recommendation
BI dashboards, interactive analytics	FTE off (streaming)
Many short queries	`QUERY` (optional)
Long-running ETL, large joins/aggregations	`TASK` + exchange manager
Spot worker clusters	`TASK` + exchange manager (close to mandatory)

8. Verification — Confirming FTE Actually Works

-- Check the current session's retry policy
SHOW SESSION LIKE 'retry_policy';
 
-- Switch at the session level (FTE for just one heavy batch on an interactive cluster)
SET SESSION retry_policy = 'TASK';

After running, you can check the coordinator Web UI's Query Detail page for retried tasks and confirm that the exchange operated in spooling mode. We recommend a spot-reclaim test: deliberately kill a worker and verify the query runs to completion.

9. Cleaning Up Spool Storage (Lifecycle)

Trino cleans up spool data when a query finishes, but abnormal terminations and interruptions can leave orphaned objects behind. Attach a lifecycle policy to the bucket (e.g., automatically delete objects under the prefix spool/ after 1-3 days) to plug the cost leak.

S3 Lifecycle Rule: prefix "spool/", expire after 2 days

10. Summary

Item	Normal mode	FTE mode
Failure handling	Entire query fails	Task/query-level retry
Intermediate data	In-memory streaming	Spooled via exchange manager
Latency	Low	Increased by spool I/O
Spot operation	Risky	Safe (major cost savings)
Suitable workload	Interactive	Long-running batches

FTE is a feature where you "pay a bit of latency for resilience." Leave it off for interactive clusters; for long-running ETL and spot-based batch clusters, configure the TASK policy with an object-storage exchange manager. Enable spool compression and attach a lifecycle policy to the bucket, and you get both resilience and cost control.

This article is based on the Trino 440 line. If you need help stabilizing large-scale batches or optimizing costs with spot instances, feel free to reach out.

— Data Dynamics Engineering Team

1. Why FTE Is Needed — Limits of the Default Execution Model

2. The Core Idea of FTE — Spooling Intermediate Results

3. Two Retry Policies — QUERY vs TASK

QUERY Policy

TASK Policy

4. Configuring the Exchange Manager

Choosing Spool Storage

Spreading Across Multiple Locations

5. Related Tuning Parameters

6. FTE and Spot Instances — The Key to Cost Savings

7. Trade-offs — What You Give Up by Enabling FTE

8. Verification — Confirming FTE Actually Works

9. Cleaning Up Spool Storage (Lifecycle)

10. Summary