airfloworchestrationdata-pipelineairflow3etl

Airflow 3 in Practice, Part 0: Why, and When to Use It

From the problems workflow orchestration solves, to the core changes in Airflow 3, to the roadmap for this 12-part practical series.

Data DynamicsJune 24, 20269 min read

This is Part 0 of the Airflow 3 in Practice series. Before diving into a full architecture dissection or cluster setup, let's step back and first sort out "so why Airflow, and when does it actually make sense to use it?" We'll take it slowly so that even readers encountering workflow orchestration for the first time can follow all the way through. Starting with the next part, Part 1: Architecture Dissection digs into the internal structure.

The limits of getting by with cron and scripts

When you have two or three data pipelines, a single line of cron is enough: "Run the extract script at 3 a.m. daily, run the load script at 4 a.m." The trouble starts the moment your pipelines multiply and dependencies emerge between steps.

Once a bundle of cron jobs or shell scripts grows past a certain scale, the following three issues almost always trip you up.

Dependencies can't be expressed. cron doesn't know "run the transform only after the extract finishes." So people paper over it with timing: "extract usually takes 30 minutes, so start transform 30 minutes later." On a day when extract takes 35 minutes, transform reads incomplete data.
There's no retry or failure handling. What if a script dies midway? cron simply stays silent until the next schedule. You have to hand-code all the retry logic, backoff, and partial reruns inside the scripts themselves.
There's no observability. To see how far last night's pipeline got, how long each step took, or why it failed, you have to SSH into the server and dig through logs. Answering "why was last Tuesday's load empty?" is practically impossible.

Comparing these three deficiencies in a single picture looks like this.

Loading diagram…

cron only knows "when to run." An orchestrator knows "what depends on what, what to do on failure, and how far things have gotten right now."

So, the problems an orchestrator solves

A workflow orchestrator represents jobs as a Directed Acyclic Graph (DAG). Nodes are things to do (tasks), and edges are dependencies — "this must finish before that runs." Looking at this graph, the orchestrator automatically handles the following.

Scheduling tasks in dependency order (running in parallel what can run in parallel)
Retries, backoff, and alerts on failure
Centralized recording and visualization of run history, logs, and elapsed time
Rerunning past intervals (backfill), manual triggering, and pausing

Airflow is the most widely used open source tool in this space, and a defining trait is that you define DAGs as Python code. Because it's code rather than YAML or a GUI, version control, testing, and code review all work out of the box.

When Airflow is a good fit / when it isn't

No tool is a cure-all. Drawing an honest line between where Airflow shines and where it doesn't looks like this.

Good fit

Batch-centric ETL/ELT, data warehouse loading, periodic report generation
Pipelines with many interdependent steps (dozens to thousands of tasks)
Work that ties together heterogeneous systems (DB → S3 → Spark → BI)
Operations where schedule + backfill + rerun history matter

Poor fit

Millisecond-level low-latency streaming (that's the domain of Kafka/Flink)
Request-response real-time API processing
Processing thousands of ultra-lightweight events per second (the scheduler overhead is a burden)

Airflow is a "batch orchestrator," not a "streaming engine." The moment you try to cram real-time processing into Airflow is usually a sign of the wrong tool choice.

A brief comparison with the alternatives

There are several good alternatives in the orchestrator ecosystem. It's better to understand them as "differences in disposition" than as an absolute ranking.

Tool	Definition style	Strengths	Where it tends to fit
Airflow	Python DAG	Widest ecosystem and integrations (providers), mature operational tooling	General-purpose batch ETL, complex dependencies, large-scale scheduling
Prefect	Python function decorators	Dynamic workflows, lightweight developer experience	Python-centric teams, dynamic/event-driven flows
Dagster	Asset-centric (software-defined assets)	First-class support for data assets, types, and tests	Teams looking to model data asset lineage and quality
Argo Workflows	Kubernetes CRD (YAML)	Container-native, K8s-friendly	Container-level pipelines on top of K8s

Interestingly, Airflow 3 introduces the Asset-based scheduling we'll see later, absorbing a good deal of the "data-asset-centric" thinking that Dagster emphasized.

Airflow 2.x → 3.x: what changed

This entire series is based on Airflow 3.0/3.x (GA in 2025). If you've used 2.x, getting the changed points into your head first will make the rest of the series much smoother. Distilled to the essentials, here's the picture.

Area	2.x	3.x	Meaning
Web component	Webserver	API server (UI + REST API unified)	A single component provides the UI and a stable, versioned REST API
DAG parsing	Inside the scheduler	DAG processor separated	DAG parsing isolated into an independent process (stability and security)
Task execution	Worker connects directly to the metadata DB	Task SDK + Task Execution API	Workers communicate through the API server → remote and language-agnostic execution
Data-aware scheduling	Dataset	Asset (`@asset`, `schedule=[asset]`)	Cleaned-up terminology and features, asset-centric scheduling
Execution tracking	Weak notion of versioning	DAG versioning	Track in the UI which version of a DAG a run executed with
Backfill	Mostly CLI	Scheduler-managed backfill (UI/API)	Trigger from the UI/API and the scheduler carries it out
`catchup` default	`True`	`False`	Prevents the accident of a new DAG running a flood of past intervals
Time field	`execution_date`	`logical_date` (or `data_interval_*`)	`execution_date` removed; can be None on manual/asset triggers
Removed items	SubDAG, SLA, old experimental REST API	Replaced by TaskGroup/Asset, Deadline, stable REST API	Cleanup of confusing legacy
Executor	Local/Celery/Kubernetes	+ EdgeExecutor, plus hybrid (multiple at once)	HTTP-based remote/edge workers, multiple executors running side by side

Import paths changed too. In 3.x, airflow.sdk is the recommended entry point.

from airflow.sdk import dag, task, Asset
 
@dag(schedule="@daily", catchup=False)  # in 3.x, the catchup default is False
def daily_etl():
    @task
    def extract():
        return {"rows": 1000}  # example value
 
    @task
    def load(payload: dict):
        print(f"Number of rows loaded (example): {payload['rows']}")
 
    load(extract())
 
daily_etl()

Why this code looks the way it does and what happens internally is covered in depth in Part 4: The Right Way to Author DAGs. For now, it's enough to remember just the one structural change: "workers no longer attach directly to the metadata DB but go through the Task Execution API."

What you'll learn in this series (roadmap)

The overall flow is laid out in the order "why → structure → setup → authoring → integration → operations." On a single page, it looks like this.

Loading diagram…

You can jump directly to each part from the table of contents below.

Part	Title	One-line gist
0	(this article) Series overview & when to use it	Defining the orchestration problem and an overview of the 3.x changes
1	Architecture Dissection	How the Scheduler, API server, DAG processor, Triggerer, and Worker mesh together
2	Setting Up as a Cluster	Going beyond a single node to deploy components separately
3	Configuration & Optimization	Tuning the three concurrency layers, Pools, and resource isolation
4	The Right Way to Author DAGs	Task SDK, decorators, TaskGroup, best practices
5	Advanced DAG Techniques	Scripts, parameters, error handling, PostgreSQL, reruns, date references
6	Scheduling & Asset	cron schedules and asset-driven scheduling
7	XCom & Passing Data	Passing data between tasks and its limits
8	Integrating External Systems & Synchronous Calls	Connections, Hooks, providers, and the deferrable pattern
9	REST API & Remote Schedule Changes	Remote triggering and control via the JWT-authenticated REST API
10	Monitoring & Operations	Metrics, alerts, logs, and the SLA replacement (Deadline)
11	Testing, CI/CD, Security	DAG testing, pipelines, and managing permissions and secrets
12	Production Checklist	Everything to check before going into production

Wrapping up

The essence of orchestration is filling cron's three deficiencies — "dependencies, retries, observability." Airflow is a mature tool for expressing all of that as Python code, and with 3.x its structure has become a notch cleaner thanks to the API server/DAG processor separation, the Task Execution API, and Asset scheduling. That said, for work of a different nature, such as real-time streaming, it's right to use a different tool.

The starting point for choosing a tool is the question, "Is my workload a batch dependency graph?" If so, Airflow is almost always a candidate.

In the next part, we'll take apart how the components we just skimmed in the table actually mesh together and run.

➡️ Next part: Part 1 — Airflow 3 Architecture Dissection