airflowxcomtaskflowdata-pipeline

Airflow 3 XCom — The Right Way to Pass Data Between Tasks

A practical guide to how XCom works, TaskFlow's automatic XCom, and using a Custom XCom Backend to route around large payloads.

Data DynamicsJuly 1, 202610 min read

This is Part 7 of the Airflow 3 in Practice series. The previous part, Scheduling & Assets, covered when a DAG should run; this time we talk about how tasks pass data to one another inside a DAG. It leads naturally into the next part, Integrating External Systems & Calling Sinks.

Tasks run in separate processes (sometimes on separate machines). So a value computed by task A can't simply be referenced like a variable by task B. What would be a plain return inside a single Python function becomes a "how do I hand this over?" problem the moment it crosses a task boundary. The thing that bridges that gap is XCom (cross-communication).

What you'll learn in this article

How XCom is stored as key-value records in the metadata DB, and the limits that come from that

The flow of xcom_push/xcom_pull and how TaskFlow API return values automatically become XCom

How to route around large payloads with a Custom XCom Backend (S3/GCS, etc.)

How XCom is now processed through the Task Execution API in Airflow 3

What you should not pass through XCom, and how it differs from an Asset

1. XCom Is Really Just a Small Note in the Metadata DB

What XCom actually is turns out to be surprisingly simple: it's a key-value record stored in a single table (xcom) of the metadata DB. A single XCom entry is identified roughly by these coordinates.

dag_id — which DAG
run_id — which run
task_id — which task left it
key — the name of the note (defaults to return_value)
value — the serialized value

In other words, XCom writes "a note left by this run, by this task" into a database, and another task reads it back using the same coordinates. The flow of task A sticking a note up (push) and task B peeling it off to read it (pull) looks like this as a diagram.

Loading diagram…

Two important properties stand out here.

First, XCom naturally flows only within the scope of the same DAG run. To pull in a note from a different run you have to specify the coordinates explicitly, and that's not a common usage pattern.

Second, the fact that the value goes straight into the DB is the starting point of every limitation. The metadata DB is the busiest resource in the cluster — the scheduler, DAG processor, and web UI hammer it constantly (for the role of the metadata DB, see the Architecture part). Writing large values to that table frequently makes the DB load and size balloon quickly, and eventually the whole cluster slows down.

XCom is a "sticky note," not a "shipping box." It was designed for passing small values (file paths, IDs, counts, configuration flags).

2. push and pull — and TaskFlow's Automatic XCom

The traditional approach is to push/pull directly through ti (the task instance) from the context. In Airflow 3, the import path has been consolidated under airflow.sdk.

from airflow.sdk import dag, task
import pendulum
 
 
@dag(schedule=None, start_date=pendulum.datetime(2026, 1, 1), catchup=False)
def manual_xcom():
    @task
    def extract(**context):
        # push by explicitly specifying a key
        context["ti"].xcom_push(key="row_count", value=1240)
        return "/data/raw/2026-06-30.parquet"  # also auto-pushed as return_value
 
    @task
    def report(**context):
        ti = context["ti"]
        count = ti.xcom_pull(task_ids="extract", key="row_count")
        path = ti.xcom_pull(task_ids="extract")  # reads return_value when key is omitted
        print(f"{count} rows at {path}")
 
    extract() >> report()
 
 
manual_xcom()

You leave a value under an arbitrary key with xcom_push, and retrieve it by specifying task_ids and key with xcom_pull. If you omit key, it reads the default key, return_value.

In practice, though, most data passing isn't done by hand-pushing/pulling like this. With the TaskFlow API, a function's return value is automatically pushed as XCom, and passing that result as an argument to another task function automatically pulls it. The code reads like an ordinary Python function call, which is intuitive.

from airflow.sdk import dag, task
import pendulum
 
 
@dag(schedule=None, start_date=pendulum.datetime(2026, 1, 1), catchup=False)
def taskflow_xcom():
    @task
    def extract() -> str:
        # the return value is automatically pushed to XCom (return_value)
        return "/data/raw/2026-06-30.parquet"
 
    @task
    def transform(path: str) -> dict:
        # the value received as an argument is the result of an automatic XCom pull
        return {"path": path, "rows": 1240}
 
    @task
    def load(summary: dict) -> None:
        print(f"loading {summary['rows']} rows from {summary['path']}")
 
    raw = extract()
    summary = transform(raw)
    load(summary)
 
 
taskflow_xcom()

The value returned by extract() is automatically stored in XCom, and the moment you pass it to transform(raw), Airflow builds the dependency and fills in the value by pulling it at execution time. No explicit >> wiring, no xcom_pull calls are needed. But it's just invisible — that value still goes through the metadata DB. Remember that automatic XCom doesn't make the limits disappear.

3. What Changed in Airflow 3 — Workers No Longer Touch the Metadata DB Directly

Back in the Airflow 2.x days, workers fired SQL directly at the metadata DB to read and write XCom. When the worker count grew into the hundreds, exploding DB connections and exposed credentials were a headache.

In Airflow 3, this structure has changed. Workers (tasks) no longer connect to the metadata DB directly; they push/pull XCom through the API server's Task Execution API. That is, xcom_push/xcom_pull internally become API calls, and the API server handles the actual DB writes on their behalf.

Loading diagram…

Thanks to this, even remote/edge workers and non-Python tasks can handle XCom the same way (for background on the Task SDK / EdgeExecutor, see the Architecture part). From the author's perspective the code stays the same; it's enough to know that the data path has been abstracted by one layer.

4. What You Should Not Pass Through XCom

Eight out of ten XCom incidents come from "I put something too big in here." Below is what you should not pass, and what to pass instead.

Don't pass	Why	Pass instead
Large DataFrames, entire datasets	Bloats the metadata DB, serialization cost explodes	The storage location (e.g., `s3://.../part-0001.parquet`)
Image/file bytes	The DB is not a binary store	An object storage path
Lists of thousands of records	A single XCom entry grows to several MB	A table/partition identifier
Passwords/tokens	Exposed in the DB, UI, and logs	Connection/Variable, a secrets manager

There is one core principle.

Pass a "pointer to the data," not the data itself. Task A writes its large result to storage and passes only the path through XCom, and task B receives that path and reads it directly.

@task
def extract() -> str:
    df = fetch_large_dataset()          # large data
    path = "s3://warehouse/staging/2026-06-30/data.parquet"
    df.to_parquet(path)                 # write to storage, and
    return path                          # through XCom, only the path (a pointer)
 
 
@task
def transform(path: str) -> str:
    df = read_parquet(path)             # receive the path and read it directly
    out = "s3://warehouse/curated/2026-06-30/data.parquet"
    process(df).to_parquet(out)
    return out

How XCom Differs from an Asset

At this point you might think of an Asset. They're similar in that they "point to a data location," but their purposes differ.

XCom: a means of communication for tasks to pass small values to each other within the same DAG run. It plays no part in scheduling.
Asset: a scheduling signal that expresses a data dependency between DAGs, so that when one DAG updates its data, another DAG that consumes it gets triggered.

To sum up: coordinating the moving parts inside a single pipeline is XCom; the baton pass between pipelines is the Asset.

5. Custom XCom Backend — Routing Around Large Payloads

Even when you follow the "pass a pointer" principle, there are still cases where you want to hand off an intermediate result whole. And the boilerplate of writing to storage by hand and returning the path every time can get tedious. This is where the Custom XCom Backend comes in.

The idea is this: store the XCom value in object storage (S3/GCS, etc.) rather than the metadata DB, and leave only a small reference pointing to its location in the metadata DB. The author just does return df as usual, and the backend handles serialization, upload, and download on its own.

Loading diagram…

The easiest approach is to use the backend provided by the official Common IO provider. This backend sends only values above a configured threshold to storage and leaves small values in the metadata DB as-is.

# airflow.cfg (or environment variables)
[core]
# specify the backend class responsible for XCom serialization/storage
xcom_backend = airflow.providers.common.io.xcom.backend.XComObjectStorageBackend
 
[common.io]
# the object storage path to store values in (configure credentials via a Connection)
xcom_objectstorage_path = s3://my-airflow-xcom/xcom
# only values exceeding this byte count go to storage; smaller ones stay in the metadata DB
xcom_objectstorage_threshold = 1048576

When provided as environment variables, they take forms like AIRFLOW__CORE__XCOM_BACKEND and AIRFLOW__COMMON_IO__XCOM_OBJECTSTORAGE_PATH (for the pattern of injecting configuration via environment variables, see the Configuration & Optimization part). If you want to implement a backend yourself, inherit from BaseXCom and override serialize_value/deserialize_value.

from airflow.models.xcom import BaseXCom
 
 
class MyS3XComBackend(BaseXCom):
    @staticmethod
    def serialize_value(value, **kwargs):
        # if the value is large, upload to S3 and return only the s3 path (reference), serialized, to the metadata DB
        ...
 
    @staticmethod
    def deserialize_value(result):
        # look at the reference in the metadata DB, download the actual value from S3, and restore it
        ...

That said, a Custom Backend is not a silver bullet either. Set the threshold too low and even small values make a round trip to storage every time, which actually slows things down, and you also have to manage the lifecycle (expiration/cleanup) of the XCom objects piling up in storage yourself. A hybrid setup that sends to storage only above the threshold is usually the sensible choice.

Wrapping Up

Once you grasp the essence of XCom as "a small note stuck between tasks," there's nothing confusing about it.

Because XCom values are stored in the metadata DB, pass only small values (pointers).
With the TaskFlow API, return values automatically become XCom and the code stays clean — but the limits remain.
Keep large data in storage and pass only the path, or route around it at the backend level with a Custom XCom Backend.
In Airflow 3, workers no longer touch the metadata DB directly; they go through the Task Execution API.
Distinguish communication within the same run (XCom) from data dependencies between DAGs (Asset).

In the next part, Integrating External Systems & Calling Sinks, we cover how to safely exchange this refined data with the outside world — databases, APIs, message queues, and the like.

To dig deeper, see the official docs on XComs and the Object Storage XCom backend.