Airflow 3 Production Best Practices Checklist
Review idempotency, resource optimization, and reliability, then verify operational readiness with a pre-deployment checklist. The final part of the Airflow 3 in Practice series.
Over the past 12 parts, we have covered everything from Airflow 3's architecture to testing, CI/CD, and security. This final part is not about learning a new feature, but about taking one more look before you put those features into real production. 90% of the "it ran fine locally but blew up in production" incidents come from missing one of these three things: idempotency, resources, or reliability.
This is Part 12 (the final part) of the "Airflow 3 in Practice series". The previous part, Testing, CI/CD, and Security, covered how to deploy pipelines safely; this part checks whether those pipelines hold up in production. There is no next part.
What this article checks
- Designing idempotent tasks that are safe to rerun
- How to save resources and cost with concurrency, Pools, and deferrable operators
- How to secure reliability with retries, timeouts, and backfill
- An operational readiness checklist to run through in one pass before deployment
- A one-page map summarizing the entire series
1. Idempotency — Tasks That Are Safe to Rerun
The first commandment of a production pipeline is idempotency. It means that no matter how many times you rerun the same task, the result must be identical. In Airflow, running the same task multiple times is routine — through retries, backfills, manual clear & rerun, and so on — so a non-idempotent task will inevitably cause incidents in production.
The key is to think in terms of "overwrite" rather than "append". Picture each run completely rewriting the single partition it is responsible for.
from airflow.sdk import dag, task
import pendulum
@dag(
schedule="@daily",
start_date=pendulum.datetime(2026, 1, 1, tz="UTC"),
catchup=False, # In Airflow 3 the default changed to False
)
def daily_etl():
@task
def write_partition(data_interval_start=None):
# Derive partition boundaries from data_interval instead of logical_date
ds = data_interval_start.format("YYYY-MM-DD")
# 1) First clear that partition (the heart of idempotency)
delete_partition(table="events", dt=ds)
# 2) Then refill only that partition — same result no matter how many reruns
insert_partition(table="events", dt=ds, rows=extract(ds))
write_partition()
daily_etl()Idempotency in one line: "Write with DELETE then INSERT, or with MERGE/UPSERT — not a bare INSERT."
For reference, in Airflow 3 execution_date has been removed and you use logical_date, but for asset-driven or manually triggered runs logical_date can be None. So when you derive a partition key, it is safer to use data_interval_start/data_interval_end whenever possible.
The flowchart below shows which gates a safe-to-rerun pipeline passes through.
One more thing here: it is important to leave no partial writes behind. If you die halfway through writing a partition, a rerun produces duplicates. Whenever possible, write to a temporary location and do an atomic swap at the end, or use operations that guarantee atomicity such as transactions, INSERT OVERWRITE, or MERGE.
2. Resource and Cost Optimization
As your pipelines grow, the next concern shifts to "we don't have enough workers" or "the cloud bill is too high". There are three main levers for saving resources.
Three layers of concurrency. As covered in detail in Part 3 (Configuration & Optimization), Airflow controls concurrency at three levels.
| Setting | Scope | Role |
|---|---|---|
parallelism | Entire system | Total number of tasks that can run concurrently |
max_active_tasks_per_dag | Per DAG | Number of tasks running concurrently within one DAG |
max_active_runs_per_dag | Per DAG | Number of concurrent runs of one DAG |
Pools. To protect a specific resource (for example, an external DB connection or a licensed API), use a Pool to cap concurrent access. Combined with priority_weight, the important tasks grab a slot first.
Deferrable operators. As we saw in Part 8 (External System Integration), if a task does nothing but wait for an external job to finish, make it deferrable and hand it off to the Triggerer. Since it polls asynchronously without occupying a worker slot, it eliminates the waste of a "task that only waits" locking up a worker.
# Bad: occupies a worker slot while sleeping
sensor = SomeSensor(task_id="wait", mode="poke")
# Good: deferrable — the Triggerer waits asynchronously, the worker is free
sensor = SomeSensor(task_id="wait", deferrable=True)Also consider worker autoscaling. KubernetesExecutor spins up a pod per task and scales naturally, while CeleryExecutor can be configured (with KEDA, etc.) to scale the number of workers up and down based on queue length. The core principle is the same: don't pay for idle workers.
3. Reliability — How to Handle Failure
Operations is not about making things "never fail", but about making them "recover even when they fail".
- Retries. Most transient errors (network, throttling) recover automatically with
retriesandretry_delay(use exponential backoff viaretry_exponential_backoff=True). But putting retries on a non-idempotent task is automating the incident, so idempotency from section 1 is the prerequisite. - Timeouts. If you don't set
execution_timeout, a stuck task occupies a worker slot forever. Set it to about 1.5–2x the normal execution time. (In Airflow 3, SLA has been removed and replaced by Deadline/deadline alerting, so use that for deadline-based alerts.) - Alerts. As covered in Part 10 (Monitoring & Operations), a failure only matters if it reaches a person. Be sure to wire up
on_failure_callbackor an alert channel. - Backfill strategy. Airflow 3 provides scheduler-managed backfill. When you trigger it from the UI/API, the scheduler performs it, so refilling historical data does not exceed your concurrency limits. Backfill, too, will mass-produce duplicates without the idempotency from section 1 — idempotency is the foundation of everything.
4. Operational Readiness Checklist
Before deploying, run through the five areas below once each. If even one item is unchecked, that part is a candidate for your next incident.
Configuration
- You deliberately confirmed
catchup=False(it is the default, but state it explicitly per DAG) - You set the three concurrency layers (
parallelism,max_active_tasks_per_dag,max_active_runs_per_dag) to match your load - Tasks that need resource protection are grouped under a Pool
- Every task has an
execution_timeout - Your executor choice (Local/Celery/Kubernetes/Edge, or hybrid) matches the workload
High Availability (HA)
- You run two or more schedulers (eliminating a single point of failure)
- The API server, DAG processor, and Triggerer are each separated into independent processes and made redundant
- The metadata DB is a managed/replicated setup, and you have actually tested backup and recovery
- DAG bundle (git, etc.) synchronization is consistent across all components
Monitoring
- You collect key metrics (scheduler heartbeat, task queue backlog, failure rate)
- Logs are aggregated to a central store (logs survive even if a worker dies)
- There is a path for failure and delay alerts to reach a person
- DAG versioning lets you trace which version ran
Security
- REST API authentication (JWT) and authorization are configured (no use of the legacy experimental API)
- Secrets (connections, variables) are managed by a secrets backend rather than in plaintext
- Workers do not connect directly to the metadata DB but communicate via the Task Execution API
Testing
- DAG integrity (import, circular dependency) tests are in CI
- Core business logic has unit tests
- Deployments must pass CI/CD gates before reaching production
For detailed security and testing items, see Part 11 (Testing, CI/CD, and Security).
5. Map of the Entire Series
Here is the 12-part journey summarized on one page. Come back to any topic you need at any time.
| Part | One-line essence | Link |
|---|---|---|
| 0 | Why Airflow 3, and when to use it | Overview |
| 1 | The Scheduler, API server, DAG processor, and Triggerer structure | Architecture |
| 2 | Bringing it up as a cluster with HA | Cluster Setup |
| 3 | Tuning concurrency, Pools, and the executor | Configuration & Optimization |
| 4 | The proper way to write DAGs with the Task SDK | DAG Authoring |
| 5 | Scripts, params, errors, reruns, and date semantics | Advanced DAG Techniques |
| 6 | Asset-based, data-aware scheduling | Scheduling & Assets |
| 7 | Passing data between tasks with XCom | XCom |
| 8 | External system integration & deferrable | External Integration |
| 9 | Remote schedule control via the REST API | REST API & Remote |
| 10 | Operating with metrics, logs, and alerts | Monitoring & Operations |
| 11 | Testing, CI/CD, and secret security | Testing, CI/CD, and Security |
| 12 | Production readiness review (this article) | — |
6. Where to Learn Next
This series is over, but Airflow keeps evolving. For your next steps, here is what we recommend.
- Read the official docs often. Especially the best practices and release notes in the official Apache Airflow documentation change with every version bump, so it is a safe habit to always verify config keys and behavior changes against the docs for your own version.
- Get to know the Provider package ecosystem. Most external integrations — AWS, GCP, Snowflake, dbt, and so on — are delivered as provider packages. Looking for an existing operator/hook before implementing your own dramatically reduces the code you write.
- New execution models in 3.x like EdgeExecutor and the hybrid executor are worth experimenting with directly if you have remote or edge workloads.
The final one-liner: Build on idempotency, save resources, and design assuming failure. These three sentences are almost all of production Airflow.
Thank you for coming this far with us. We hope this series has helped make your data pipelines a little more solid.