ai-agentobservabilityagentopstracingprometheusai

AgentOps: Agent Observability and Tracing — What Was It Thinking When It Called That Tool?

Operations trilogy, Part 1. The observability challenges unique to agents, distinct from LLM monitoring: step- and tool-level tracing, token/cost budgets, failure reproduction, and the practical patterns for instrumenting agents with Prometheus and Argus self-telemetry.

Data DynamicsJune 20, 20269 min read

When an agent misbehaves in production, the first wall you hit is "I have no idea why it did that." Yesterday it produced the right SQL for a question, today it picked the wrong table — and the logs only have the final output. If you can't see what reasoning happened in between, which tools it called and why, or how many tokens it burned, debugging becomes fortune-telling.

This is the starting point of AgentOps — agent operations. This article is Part 1 of the Operations trilogy, the first of the three bridges you must lay across the river to put a data engineering agent into production: observability.

What you'll learn

How agent observability differs from ordinary LLM monitoring

How to unfold an agent's reasoning with traces and spans

How to track tokens, cost, and latency at the step level

How to reproduce (replay) a failed run

Practical patterns for instrumenting agents with Prometheus and self-telemetry

Operations trilogy — Part 1 AgentOps observability (this article) · Part 2 Agent Evaluation · Part 3 Agent Security

1. What's Different From LLM Monitoring

LLM monitoring usually looks at one call — input prompt, output tokens, latency, cost. For single-shot inference, that's enough. But an agent weaves many calls and tool executions into a loop for a single goal. The unit of observation becomes not "a call" but a trajectory.

Loading diagram…

So the questions agent observability must answer differ in kind from LLM monitoring.

Question	LLM monitoring	Agent observability
What did you see	1 call	the whole trajectory (N steps)
Cost	tokens per call	cumulative tokens per trajectory + tool cost
Failure	4xx/5xx, timeout	wrong tool choice, loop, hallucination
Debugging	view the prompt	replay per-step reasoning and tool I/O
Key metrics	latency, error rate	task success rate, step count, retry rate

If LLM Inference Optimization is "make one call fast and cheap," AgentOps is "make the whole trajectory of many woven calls visible."

2. Traces and Spans — Unfolding the Reasoning

The basic data structure of agent observability is the trace and span, borrowed from distributed tracing. One goal handled = one trace; each step within it (reason, tool call, observe) = one span. Nest the spans and "which tool call came from which piece of reasoning" unfolds as a tree.

Loading diagram…

At minimum, leave this much on each span so it's useful later.

Input/output — prompt and response on reasoning spans, args and return value on tool spans (mask sensitive data)
Tokens/cost — input/output tokens, model, estimated cost
Latency — span start/end
Status — success/failure/retry, error message
Correlation ID — thread all spans together by trace_id

Export this data in a standard format (OpenTelemetry's GenAI semantic conventions are becoming the de facto standard) and you can lay it right on top of your existing observability stack.

3. Tokens, Cost, Latency — Why You Must View Them Per Step

Agent cost explodes per trajectory. The user asked one question, but internally 5 LLM calls + 3 Trino scans may have run. Look only at per-call cost and you fall into the illusion that it "looks cheap."

Loading diagram…

So the cost metrics to track are:

Tokens/cost per trajectory — "how much per question on average" is the real unit cost
Step-count distribution — a rising average step count signals inefficiency/loops
Retry rate — frequent self-correction = low first-attempt quality
Cost per tool — which tool (especially full-scan queries) eats the money

These map straight onto Prometheus metrics. The design principles of counters, histograms, and gauges are in the Prometheus Metrics Guide, and collector setup in the Prometheus Installation Guide. For example:

# Instrument agent trajectories as Prometheus metrics
from prometheus_client import Counter, Histogram
 
AGENT_TOKENS = Counter("agent_tokens_total", "cumulative tokens", ["model", "phase"])
AGENT_STEPS = Histogram("agent_steps", "steps per trajectory", buckets=[1,2,3,5,8,13])
TOOL_LATENCY = Histogram("agent_tool_seconds", "tool latency", ["tool"])
 
def record_step(model, phase, in_tok, out_tok):
    AGENT_TOKENS.labels(model, phase).inc(in_tok + out_tok)

For rules that fire alerts when a budget threshold is crossed, follow the Prometheus Alert Rules Guide. An alert like "average tokens per trajectory doubled over the last hour" catches loop runaways early.

4. Failure Reproduction — Not Reading Logs, But Re-running

The holy grail of agent debugging is replay. Store every input of a failed trajectory — prompts, tool return values, model/parameters — and you can replay that moment exactly to pinpoint "where it lost its way."

Loading diagram…

For replay to be possible, there are things to handle at design time.

Secure determinism — where possible, temperature=0 and a fixed seed. Even if not fully deterministic, preserving inputs is analytically valuable enough
Store tool return values — external state (query results, etc.) changes as time passes, so keep the return value at that time too
Version tagging — stamp prompt/model/tool-schema versions onto the trajectory (to confirm it's the same code)

This replay dataset isn't thrown away. It becomes the golden dataset for regression testing — bundle past failed trajectories as fixed cases, and every time you fix a prompt you can automatically verify "did an old bug come back." This link is the very starting point of the next Part 2, Agent Evaluation.

5. The Agent Instruments Itself — Self-Telemetry

Instead of bolting observability on from the outside, you can have the agent register itself as a thing to be instrumented. We already applied this pattern to the Argus catalog agent — at boot, the agent registers itself in the catalog registry and pushes its call metrics as telemetry.

Loading diagram…

The beauty of this design is that observability becomes a first-class responsibility of the agent, not a byproduct of operations. The concrete implementation — registry self-registration and call-metric push, and using the Pushgateway in a polling environment — is laid out in The AI That Governs the Catalog (self-telemetry article) and the Prometheus Pushgateway Guide.

Go one step further and you can have the agent's telemetry itself analyzed by another agent. The approach of reading anomalies out of logs/metrics with an LLM is covered in LLM-Based Log Analysis · AIOps.

6. A Checklist for an Observable Agent

Right before production, this much should be in place to avoid "I have no idea why it did that."

Every goal-handling gets a trace_id, and every step is recorded as a span
Each span has input, output, tokens, cost, latency, status (with sensitive data masked)
Tokens, cost, and step count are aggregated as metrics per trajectory
Alerts fire when budget/retry-rate thresholds are crossed
Failed trajectories can be reproduced, including inputs and tool return values
Replay cases accumulate as a regression test dataset
Every tool call lands in an audit log (for security/governance)

Closing — You Can't Fix What You Can't See

The first truth of agent operations is simple. You can't fix what you can't see. If LLM monitoring views a point, AgentOps must view a trajectory; unfold that trajectory into traces and spans, track tokens and cost at the step level, and only when you can reproduce a failure exactly are you ready to trust the agent.

But even when observability shows "what happened," it still doesn't answer "was that a good result?" Data engineering agents often have more than one right answer (there are many SQLs that produce the same answer). So the next Part 2, Agent Evaluation, covers how to measure agent quality when there isn't a single right answer.

In one sentence: The unit of agent observability is the trajectory, not the call; only with tracing, cost, and replay in place does debugging turn from fortune-telling into engineering.