Blog
airfloworchestrationdata-pipelineairflow3etl

Airflow 3 in Practice, Part 0: Why, and When to Use It

From the problems workflow orchestration solves, to the core changes in Airflow 3, to the roadmap for this 12-part practical series.

Data DynamicsJune 24, 20269 min read

This is Part 0 of the Airflow 3 in Practice series. Before diving into a full architecture dissection or cluster setup, let's step back and first sort out "so why Airflow, and when does it actually make sense to use it?" We'll take it slowly so that even readers encountering workflow orchestration for the first time can follow all the way through. Starting with the next part, Part 1: Architecture Dissection digs into the internal structure.

The limits of getting by with cron and scripts

When you have two or three data pipelines, a single line of cron is enough: "Run the extract script at 3 a.m. daily, run the load script at 4 a.m." The trouble starts the moment your pipelines multiply and dependencies emerge between steps.

Once a bundle of cron jobs or shell scripts grows past a certain scale, the following three issues almost always trip you up.

  • Dependencies can't be expressed. cron doesn't know "run the transform only after the extract finishes." So people paper over it with timing: "extract usually takes 30 minutes, so start transform 30 minutes later." On a day when extract takes 35 minutes, transform reads incomplete data.
  • There's no retry or failure handling. What if a script dies midway? cron simply stays silent until the next schedule. You have to hand-code all the retry logic, backoff, and partial reruns inside the scripts themselves.
  • There's no observability. To see how far last night's pipeline got, how long each step took, or why it failed, you have to SSH into the server and dig through logs. Answering "why was last Tuesday's load empty?" is practically impossible.

Comparing these three deficiencies in a single picture looks like this.

Loading diagram…

cron only knows "when to run." An orchestrator knows "what depends on what, what to do on failure, and how far things have gotten right now."

So, the problems an orchestrator solves

A workflow orchestrator represents jobs as a Directed Acyclic Graph (DAG). Nodes are things to do (tasks), and edges are dependencies — "this must finish before that runs." Looking at this graph, the orchestrator automatically handles the following.

  • Scheduling tasks in dependency order (running in parallel what can run in parallel)
  • Retries, backoff, and alerts on failure
  • Centralized recording and visualization of run history, logs, and elapsed time
  • Rerunning past intervals (backfill), manual triggering, and pausing

Airflow is the most widely used open source tool in this space, and a defining trait is that you define DAGs as Python code. Because it's code rather than YAML or a GUI, version control, testing, and code review all work out of the box.

When Airflow is a good fit / when it isn't

No tool is a cure-all. Drawing an honest line between where Airflow shines and where it doesn't looks like this.

Good fit

  • Batch-centric ETL/ELT, data warehouse loading, periodic report generation
  • Pipelines with many interdependent steps (dozens to thousands of tasks)
  • Work that ties together heterogeneous systems (DB → S3 → Spark → BI)
  • Operations where schedule + backfill + rerun history matter

Poor fit

  • Millisecond-level low-latency streaming (that's the domain of Kafka/Flink)
  • Request-response real-time API processing
  • Processing thousands of ultra-lightweight events per second (the scheduler overhead is a burden)

Airflow is a "batch orchestrator," not a "streaming engine." The moment you try to cram real-time processing into Airflow is usually a sign of the wrong tool choice.

A brief comparison with the alternatives

There are several good alternatives in the orchestrator ecosystem. It's better to understand them as "differences in disposition" than as an absolute ranking.

ToolDefinition styleStrengthsWhere it tends to fit
AirflowPython DAGWidest ecosystem and integrations (providers), mature operational toolingGeneral-purpose batch ETL, complex dependencies, large-scale scheduling
PrefectPython function decoratorsDynamic workflows, lightweight developer experiencePython-centric teams, dynamic/event-driven flows
DagsterAsset-centric (software-defined assets)First-class support for data assets, types, and testsTeams looking to model data asset lineage and quality
Argo WorkflowsKubernetes CRD (YAML)Container-native, K8s-friendlyContainer-level pipelines on top of K8s

Interestingly, Airflow 3 introduces the Asset-based scheduling we'll see later, absorbing a good deal of the "data-asset-centric" thinking that Dagster emphasized.

Airflow 2.x → 3.x: what changed

This entire series is based on Airflow 3.0/3.x (GA in 2025). If you've used 2.x, getting the changed points into your head first will make the rest of the series much smoother. Distilled to the essentials, here's the picture.

Area2.x3.xMeaning
Web componentWebserverAPI server (UI + REST API unified)A single component provides the UI and a stable, versioned REST API
DAG parsingInside the schedulerDAG processor separatedDAG parsing isolated into an independent process (stability and security)
Task executionWorker connects directly to the metadata DBTask SDK + Task Execution APIWorkers communicate through the API server → remote and language-agnostic execution
Data-aware schedulingDatasetAsset (@asset, schedule=[asset])Cleaned-up terminology and features, asset-centric scheduling
Execution trackingWeak notion of versioningDAG versioningTrack in the UI which version of a DAG a run executed with
BackfillMostly CLIScheduler-managed backfill (UI/API)Trigger from the UI/API and the scheduler carries it out
catchup defaultTrueFalsePrevents the accident of a new DAG running a flood of past intervals
Time fieldexecution_datelogical_date (or data_interval_*)execution_date removed; can be None on manual/asset triggers
Removed itemsSubDAG, SLA, old experimental REST APIReplaced by TaskGroup/Asset, Deadline, stable REST APICleanup of confusing legacy
ExecutorLocal/Celery/Kubernetes+ EdgeExecutor, plus hybrid (multiple at once)HTTP-based remote/edge workers, multiple executors running side by side

Import paths changed too. In 3.x, airflow.sdk is the recommended entry point.

from airflow.sdk import dag, task, Asset
 
@dag(schedule="@daily", catchup=False)  # in 3.x, the catchup default is False
def daily_etl():
    @task
    def extract():
        return {"rows": 1000}  # example value
 
    @task
    def load(payload: dict):
        print(f"Number of rows loaded (example): {payload['rows']}")
 
    load(extract())
 
daily_etl()

Why this code looks the way it does and what happens internally is covered in depth in Part 4: The Right Way to Author DAGs. For now, it's enough to remember just the one structural change: "workers no longer attach directly to the metadata DB but go through the Task Execution API."

What you'll learn in this series (roadmap)

The overall flow is laid out in the order "why → structure → setup → authoring → integration → operations." On a single page, it looks like this.

Loading diagram…

You can jump directly to each part from the table of contents below.

PartTitleOne-line gist
0(this article) Series overview & when to use itDefining the orchestration problem and an overview of the 3.x changes
1Architecture DissectionHow the Scheduler, API server, DAG processor, Triggerer, and Worker mesh together
2Setting Up as a ClusterGoing beyond a single node to deploy components separately
3Configuration & OptimizationTuning the three concurrency layers, Pools, and resource isolation
4The Right Way to Author DAGsTask SDK, decorators, TaskGroup, best practices
5Advanced DAG TechniquesScripts, parameters, error handling, PostgreSQL, reruns, date references
6Scheduling & Assetcron schedules and asset-driven scheduling
7XCom & Passing DataPassing data between tasks and its limits
8Integrating External Systems & Synchronous CallsConnections, Hooks, providers, and the deferrable pattern
9REST API & Remote Schedule ChangesRemote triggering and control via the JWT-authenticated REST API
10Monitoring & OperationsMetrics, alerts, logs, and the SLA replacement (Deadline)
11Testing, CI/CD, SecurityDAG testing, pipelines, and managing permissions and secrets
12Production ChecklistEverything to check before going into production

Wrapping up

The essence of orchestration is filling cron's three deficiencies — "dependencies, retries, observability." Airflow is a mature tool for expressing all of that as Python code, and with 3.x its structure has become a notch cleaner thanks to the API server/DAG processor separation, the Task Execution API, and Asset scheduling. That said, for work of a different nature, such as real-time streaming, it's right to use a different tool.

The starting point for choosing a tool is the question, "Is my workload a batch dependency graph?" If so, Airflow is almost always a candidate.

In the next part, we'll take apart how the components we just skimmed in the table actually mesh together and run.

➡️ Next part: Part 1 — Airflow 3 Architecture Dissection