airflowmonitoringobservabilityprometheusgrafana

Airflow 3 Monitoring & Operations — Logs, Metrics, and Alerts in One Place

An operations-focused guide to Airflow 3: centralized logging (S3/GCS), StatsD/OpenTelemetry → Prometheus/Grafana metrics, on_failure_callback alerts, the Deadline concept that replaced SLAs, and per-component health checks.

Data DynamicsJuly 4, 202610 min read

This is Part 10 of the Airflow 3 in Practice series — Monitoring & Operations. The previous part covered the REST API and remote schedule changes, and the next part continues with Testing, CI/CD & Security.

When you have only one or two pipelines, "fix it when the UI turns red" is enough. But once your DAGs grow into the hundreds and the scheduler, triggerer, and DAG processor start running on separate nodes, the story changes. The system has to tell you what's slow, what's piling up, and who failed before a human ever notices.

This article walks through how to weave together three pillars — logs, metrics, and alerts — to build operational visibility in an Airflow 3 environment, and how to handle the things that changed in Airflow 3 (the removal of SLAs, and log handling under the Task Execution API).

What you'll learn in this article

Why and how to centralize task logs into remote storage (S3/GCS)

The pipeline for exporting metrics with StatsD/OpenTelemetry and viewing them in Prometheus/Grafana

The key indicators you must watch and what they mean

on_failure_callback alerts, and the Deadline concept that replaced SLAs

Per-component health check endpoints

1. Logs: First, Gather Them in One Place

By default, Airflow task logs accumulate on the local disk of the worker that ran the task. With a single worker that's fine, but with multiple workers — or in an environment like Kubernetes where the Pod disappears when it finishes — you run into "I want to see the failed task's logs, but the worker is already gone."

The solution is simple: centralize logs into object storage. Airflow supports remote logging out of the box, so when a task finishes it uploads the local logs to a backend like S3, GCS, or Azure Blob, and the UI reads the logs from there.

# The [logging] section of airflow.cfg (or AIRFLOW__LOGGING__* environment variables)
[logging]
remote_logging = True
remote_base_log_folder = s3://my-airflow-logs/logs
remote_log_conn_id = aws_logs

The key is the Connection that remote_log_conn_id points to. If you store the storage access permissions (an IAM role or keys) in this connection, Airflow uses those permissions to upload and download logs. Even if a worker dies, the logs remain in the bucket and the UI queries them as usual.

Logs Under the Task Execution API

In Airflow 3, workers (tasks) no longer connect directly to the metadata DB; instead they exchange state through the API server's Task Execution Interface (see the Architecture part). Here's what this change means for logs:

The task logs themselves are still generated on the execution side (worker/Pod/edge worker).
But because the execution environment is decoupled from the metadata DB (especially for tasks running remotely or at the edge via EdgeExecutor), the approach of having an operator SSH into the local disk where logs live no longer works.
As a result, in Airflow 3 remote logging is no longer "nice to have" but effectively a baseline assumption. No matter where execution runs, logs are gathered into a common bucket and queried consistently through the UI/REST API provided by the API server.

The more distributed and remote execution you have, the more that making "where are the logs?" a non-question is half the battle of operations. The answer should always be "the bucket."

2. Metrics: From StatsD/OpenTelemetry to Prometheus/Grafana

If logs are about "what happened," metrics are about "what state is the system in right now." Internally, Airflow can export numbers such as scheduler lag, queue length, and task success/failure counts as metrics. There are broadly two ways to export them:

StatsD: The long-established approach. When Airflow emits numbers over the StatsD protocol, statsd_exporter receives them and converts them into a form Prometheus can scrape.
OpenTelemetry (OTel): A standardized observability framework. When you send metrics and traces to the OTel Collector, the Collector routes them to various backends including Prometheus. If you're starting fresh, OTel with its broad ecosystem is the more advantageous long-term choice.

Either way, the final picture looks similar: metrics accumulate in Prometheus and you view them in Grafana. The diagram below shows how logs and metrics flow out to external systems.

Loading diagram…

You configure this in the [metrics] section. For example, turning on StatsD looks like this:

[metrics]
statsd_on = True
statsd_host = statsd-exporter
statsd_port = 8125
statsd_prefix = airflow

If you use OpenTelemetry, enable the OTel-related options in [metrics] (e.g., otel_on, the OTel Collector endpoint) and handle the rest of the routing in the Collector configuration. For the exact option names and the scope of support, it's safest to check the official documentation for the Airflow version you're running.

The Key Indicators You Must Watch

Rather than cramming dozens of panels into a dashboard, it's far better to watch the following few clearly. The numbers below are examples meant to illustrate the meaning.

Indicator	What it tells you	Warning sign (example)
Scheduler heartbeat / lag	Is the scheduler alive and running on time?	Heartbeat drops out, schedule lag steadily increasing
Queue backlog (queued tasks)	Are tasks waiting to run piling up?	Queued count keeps rising and never drops
Task failure rate	Are failures higher than usual?	Failure rate spikes for a specific DAG/time window
DAG parse time	Is the DAG processor being dragged down by heavy DAGs?	Parse time grows and new DAGs take longer to appear
Pool utilization	Is a resource-isolation pool full?	A specific pool is always at 100%, with waiting accumulating

These five let you confirm — in real time rather than after the fact — whether the three concurrency layers (parallelism, max_active_tasks_per_dag, max_active_runs_per_dag) and Pool settings covered in the Configuration & Optimization part are actually working well. If the queue keeps piling up, it's a sign that parallelism is too low or workers are insufficient; if one pool is always full, it means you need to recalculate that pool's size.

3. Alerts: The Path That Delivers Failures to People

Even when an indicator turns red, it means nothing if no one looks at it. Alerts are the last link that makes sure "when something goes wrong, it reaches a person." In Airflow 3, it's cleanest to think of alerts along two tracks.

(1) Task/DAG Callbacks — on_failure_callback

The most direct method is to attach a callback function to a task or DAG. When a task fails, Airflow calls that function, and inside it you send a Slack message or email.

from airflow.sdk import dag, task
 
def notify_slack(context):
    ti = context["task_instance"]
    msg = f":red_circle: Failed: {ti.dag_id}.{ti.task_id} (run {context['run_id']})"
    # Send to Slack/Webhook etc. here
    send_to_slack(msg)
 
@dag(
    schedule="@daily",
    catchup=False,                 # The default is False in Airflow 3
    default_args={"on_failure_callback": notify_slack},
)
def sales_pipeline():
    @task
    def extract():
        ...
    extract()
 
sales_pipeline()

The callback receives, via context, which task failed, in which run, and when. For Slack alerts you can either fire a webhook directly or use a community provider package, and email alerts integrate with the SMTP configuration covered in the Cluster Setup part. Note that the import path is airflow.sdk — that's the Task SDK path in Airflow 3.

Callbacks are strong at alerting on "a failure that already happened." On the other hand, the situation of "something that should have happened didn't" needs a different tool.

(2) SLAs Are Gone — Replaced by Deadline

Airflow 2.x had an SLA (Service Level Agreement) feature. The idea was "this task must finish within N minutes, otherwise alert on an SLA miss." However, the old SLA feature was removed in Airflow 3. Its behavior was unintuitive (especially the calculation based on logical_date) and caused a lot of confusion.

The Deadline / deadline alerting concept takes its place. The core idea is similar to SLAs — "alert if it doesn't complete by a certain point" — but it differs in that the reference point and the action are defined more explicitly. Where an SLA was implicit, like "N minutes from this task's logical_date," a Deadline writes the intent more clearly, like "run this callback when this reference time plus a grace period is exceeded."

The shift from SLA to Deadline is not a mere rename but a demand to make clear "relative to when do we judge lateness."

If you're migrating a 2.x DAG that used SLA-based alerts, you need to redesign that logic on top of Deadline. Confirm the exact API shape and options against the official documentation (airflow.apache.org) for the 3.x minor version you're using — this area is still being refined version by version.

4. Health Checks: Check Per-Component Whether It's Alive

If metrics and alerts look at "the quality of operation," health checks answer the most basic question: "is it alive?" Because Airflow 3 components are split into multiple processes — as you saw in the Architecture part — you need to check health per component.

Loading diagram…

In operations, the usual approach looks like this:

Use the API server's /health endpoint as a health probe. This response includes the metadata DB connection status along with the recent heartbeat status of components like the scheduler, triggerer, and DAG processor. If a heartbeat is too old, that component is considered to have stalled.
If you deploy on Kubernetes, wire this health information into each component's liveness/readiness probe so that stalled Pods are automatically restarted.
It's even more powerful when combined with metrics (Section 2). If you set a rule so that Alertmanager fires when the "scheduler heartbeat lag" indicator crosses a threshold, you won't have to refresh the health endpoint by hand.

The endpoint paths and response schema details can vary slightly by version, so finalize the path you probe against the documentation for your operating version.

5. Wrap-up — Three Pillars Into One Operational Loop

Operational visibility isn't the sum of three tools running separately; it's a single loop.

Logs are centralized into remote storage — even if a worker dies or runs at the edge, the answer is always "the bucket."
Metrics are exported via StatsD/OTel and viewed in Prometheus/Grafana — with five at the core: heartbeat, queue backlog, failure rate, parse time, and pool utilization.
Alerts deliver failures immediately via on_failure_callback, and handle "lateness" with the Deadline that replaced SLAs.
Health checks confirm per-component liveness via /health, automated by tying them to metric alerts.

Once this loop is in place, the next step is toward "preventing failures before they happen." The next part, Testing, CI/CD & Security, covers how to validate DAGs before they reach production and raise stability through deployment pipelines and access control.

Good monitoring is judged not by the number of dashboards, but by whether "an alert arrives before a person sees that something has gone wrong."