airflowkubernetesdocker-composehelmcluster

Building an Airflow 3 Cluster — From Docker Compose to Kubernetes

A hands-on guide to scaling single-node Airflow 3 step by step into a multi-component Docker Compose setup and a Kubernetes + official Helm chart production cluster.

Data DynamicsJune 26, 202610 min read

Introduction

This article is Part 2 of the Airflow 3 in Practice series. In Part 1, Anatomy of the Architecture, we looked at what the Scheduler, API server, DAG processor, Triggerer, Worker, and Metadata DB each do. This part covers how to actually spread those components across multiple processes and multiple nodes and bind them into a cluster. We'll follow the path of starting with a single node, splitting the components apart with Docker Compose, and finally standing up a production cluster on Kubernetes using the official Helm chart.

By the end, you'll have the "vessel" ready for the concurrency, pool, and executor tuning we'll cover in the next part, Configuration & Optimization.

The essence of clustering fits in one sentence: push state out, and make components scalable.

Why You Need to Move Beyond a Single Node

When you first learn Airflow, the easiest setup is running every component in a single process on one machine with LocalExecutor. But the moment you move it to production, you hit three walls.

Single point of failure (SPOF) — if that one machine dies, the entire pipeline stops.
Scaling limits — as tasks grow, the CPU/memory of a single machine can't keep up.
Volatile state — if the Metadata DB lives inside the same container, a single restart can wipe out your execution history.

The path to a solution is incremental. As the flow below shows, you climb in the order "separate state → separate components → distribute across nodes → add redundancy."

Loading diagram…

Each stage doesn't throw away the previous one — it builds on top of it. The component-separation concepts you learn with Docker Compose carry straight over to Kubernetes.

Stage 1: Separating Components with Docker Compose

The first thing to do is split the structure that crammed every component into a single process into per-component containers. Airflow 3 has clearly separated components, so this work feels natural.

Use the official docker-compose.yaml as-is, but don't forget it's meant for development and small-scale validation. Real production goes to Kubernetes.

Single-Host Topology

With CeleryExecutor, the Scheduler puts work onto a broker (Redis), and Workers pull it off and run it. It's all within one host, but each component runs as an independent container.

Loading diagram…

Let's point out just one thing about the arrows. In Airflow 3, the Worker no longer connects directly to the Metadata DB — it communicates through the API server's Task Execution API (the dotted line). This change makes it far safer to move workers onto a different node, or even a different network. For the full background, see Part 1, Anatomy of the Architecture.

Key Excerpts from docker-compose.yaml

You can get the full file from the official Running Airflow in Docker documentation. Here we've trimmed it down to just the essentials so you can see how the components are split apart.

x-airflow-common: &airflow-common
  image: apache/airflow:3.0.0
  environment: &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: CeleryExecutor
    AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
    # Send remote logs to object storage (example — S3)
    AIRFLOW__LOGGING__REMOTE_LOGGING: "true"
    AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER: s3://my-airflow-logs/logs
    AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID: aws_logs
  volumes:
    - ./dags:/opt/airflow/dags
    - ./logs:/opt/airflow/logs
    - ./config:/opt/airflow/config
  depends_on: &airflow-common-depends-on
    redis: { condition: service_healthy }
    postgres: { condition: service_healthy }
 
services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    volumes:
      - postgres-db-volume:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "airflow"]
      interval: 10s
      retries: 5
 
  redis:
    image: redis:7.2
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      retries: 5
 
  airflow-api-server:
    <<: *airflow-common
    command: api-server
    ports:
      - "8080:8080"
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/api/v2/version"]
      interval: 30s
 
  airflow-scheduler:
    <<: *airflow-common
    command: scheduler
 
  airflow-dag-processor:
    <<: *airflow-common
    command: dag-processor
 
  airflow-triggerer:
    <<: *airflow-common
    command: triggerer
 
  airflow-worker:
    <<: *airflow-common
    command: celery worker
 
volumes:
  postgres-db-volume:

Notice that only the single command: line differs, while the rest of the configuration (x-airflow-common) is shared. One image, a different command per component — this is the basic pattern of running Airflow 3 in containers. A single docker compose up -d brings up all six components at once.

Command names can change between versions (for example, api-server replaces the old webserver in 3.x). Align with the official compose file for the image tag you're using.

Stage 2: A Production Cluster with Kubernetes + the Official Helm Chart

Docker Compose is tied to a single host. Real production is distributed across nodes, recovers itself when something fails, and scales workers up and down with load. The most standard way to solve this is to deploy the official Apache Airflow Helm chart on top of Kubernetes.

Kubernetes Distributed Topology

What were containers in Compose become Deployments / Pods here, and the state stores (Postgres, Redis, object storage) move out to managed services outside the cluster. That's the key point — state outside the cluster, compute inside.

Loading diagram…

This diagram differs from the Stage 1 diagram in two ways. (1) The state stores have moved outside the cluster, and (2) the components are made redundant with multiple replicas. These two things are the foundation of high availability and scalability.

Key Excerpts from values.yaml

You add the official chart with helm repo add apache-airflow https://airflow.apache.org and then install it. Here are just the key things you touch in values.yaml.

# Which executor runs the tasks — CeleryExecutor or KubernetesExecutor
executor: "CeleryExecutor"
 
# Always pin the image tag explicitly
images:
  airflow:
    repository: apache/airflow
    tag: "3.0.0"
 
# Scheduler redundancy (active-active HA)
scheduler:
  replicas: 2
 
# API server (UI + REST API) redundancy
apiServer:
  replicas: 2
 
# DAG processor (a standalone component in 3.x)
dagProcessor:
  enabled: true
 
# Triggerer — handles the async waits of deferrable operators
triggerer:
  replicas: 2
 
# Base number of Celery workers (autoscaling is via KEDA below)
workers:
  replicas: 2
  keda:
    enabled: true          # autoscale based on queue depth
    minReplicaCount: 1
    maxReplicaCount: 10
 
# Use an external managed DB for metadata instead of the chart's built-in Postgres
postgresql:
  enabled: false
data:
  metadataConnection:
    user: airflow
    pass: <secret reference>
    host: postgres-ha.db.svc.cluster.local
    port: 5432
    db: airflow
 
# When using an external Celery broker, disable the built-in redis too
redis:
  enabled: false
 
# Remote logs — to object storage (example)
config:
  logging:
    remote_logging: "True"
    remote_base_log_folder: "s3://my-airflow-logs/logs"
    remote_log_conn_id: "aws_logs"
 
# Pull DAGs with git-sync (the DAG bundle mechanism)
dags:
  gitSync:
    enabled: true
    repo: https://github.com/your-org/airflow-dags.git
    branch: main
    subPath: "dags"

The key names above can vary between chart versions. In particular, for components new to Airflow 3 such as apiServer and dagProcessor, always verify the exact keys against the values.yaml of the chart version you intend to install. The official Airflow Helm Chart documentation is the source of truth.

Should You Use KubernetesExecutor or CeleryExecutor?

Item	CeleryExecutor	KubernetesExecutor
Worker form	Always-on worker pool	A Pod is created and torn down per task
Broker required	Required (Redis/RabbitMQ)	Not required
Startup latency	Short (workers are waiting)	Present (Pod scheduling)
Resource isolation	Per worker	Per task (strong)
Best for	Many short, frequent tasks	Large per-task resource variance

Airflow 3 lets you configure multiple executors at once (hybrid), so you can route short tasks to Celery and heavy batches to Kubernetes. To start simple, we recommend the CeleryExecutor + KEDA combination.

Separating the State Stores: The Real Heart of a Cluster

No matter how much you replicate your components, it's meaningless if the state store is a single point of failure. You take three kinds of state outside the cluster and make each one robust.

Metadata DB (Postgres HA) — all execution history, DAG versions, and connection info live here. Use a primary + replica setup or a managed service (AWS RDS, Cloud SQL, etc.), and be sure to enable automatic backups. If this is lost, Airflow's entire memory disappears.
Broker (Redis/RabbitMQ for Celery) — this is the queue through which the Scheduler hands work to the workers. It's only needed with CeleryExecutor and is unnecessary with KubernetesExecutor. Secure availability with managed Redis or clustered RabbitMQ.
Remote logs (S3/GCS) — Pods can disappear at any time, so the task logs a worker produces must be uploaded to object storage immediately to be preserved. The remote_logging settings in the excerpt above play this role.

Remember: treat Pods and containers like cattle, and treat the state stores like pets. Compute can die and come back at any time, but state can't.

High Availability: Redundancy and Autoscaling

The last stage is "don't stop even if one machine dies, and scale up and down with load."

Scheduler redundancy (active-active) — Airflow 3's Scheduler can run several instances at once. It coordinates through row-level locks in the metadata DB so no task is picked up twice, so with scheduler.replicas: 2 or more, if one instance dies, another takes over scheduling. No separate setup like leader election is needed.
API server / Triggerer redundancy — UI and REST API traffic, as well as the async waits of deferrable operators, are also distributed and made redundant by increasing replicas.
Worker autoscaling (KEDA) — with CeleryExecutor, KEDA watches the number of pending jobs in the broker queue and automatically scales worker Pods up, then scales them down when things are idle. With KubernetesExecutor, a Pod is created per task, so you delegate to the cluster's node autoscaler (such as Cluster Autoscaler).

The premise that makes all this redundancy work safely is that state has already been pushed out. That's why we called separating state the "real heart" of clustering.

Wrapping Up

In this part, we started from a single node, separated the components with Docker Compose, stood up a distributed cluster with Kubernetes + the official Helm chart, and then built up through state-store separation all the way to high availability. The core principle is exactly what we promised at the start — state out, compute scalable.

The vessel that is the cluster is ready. In the next part, Configuration & Optimization, we'll cover how to draw out real performance on top of it, through the three layers of concurrency such as parallelism, max_active_tasks_per_dag, and max_active_runs_per_dag, along with Pools and executor tuning.

When you first stand up a cluster, always start small. Get component separation into your muscle memory with Compose, and when you move on to Kubernetes, the same concepts carry straight over.