why ai agents need observability
your agent just failed. it was a 10-step workflow: pull user data from postgres, enrich it with a third-party api, classify intent via gpt-4, select a tool, execute the tool, validate the output, format a response, log the result, update the database, send a webhook. it failed at step 9, the database update. a constraint violation. you stare at the error, but the real question isn't why the insert failed. the real question is: what happened at step 3 that produced a classification that led step 4 to pick the wrong tool, which generated an output at step 6 that looked valid but contained a malformed field that didn't surface until step 9 tried to write it?
you don't know. you can't know. your logs say "step 9: constraint violation." helpful.
so you do what every developer building with agents does right now. you re-run the entire chain. all 10 steps. the same api calls, the same llm invocations, the same compute bill. except this time you add print statements. maybe you dump the intermediate state to a json file. maybe you squint at a datadog trace that shows you 10 spans with latency numbers and tells you absolutely nothing about why the agent chose what it chose.
this is the state of agent debugging in 2025. it's embarrassing.
your apm tool doesn't understand agents
let me be specific about why traditional monitoring fails here. datadog, new relic, sentry — these are excellent tools for request-response architectures. a user hits an endpoint, the endpoint does work, the endpoint returns. you get latency percentiles, error rates, stack traces. you can trace a request through microservices. great.
an agent workflow is not a request-response. it's a decision tree that unfolds over time, where each node depends on the output of previous nodes, where the "logic" isn't in your code but in an llm's token predictions, and where failures cascade silently because the llm doesn't throw exceptions — it just produces subtly wrong output that passes every type check you've written.
consider what you actually need to debug the scenario above. you need to see the exact prompt that went into step 3, the exact response that came back, and the specific tokens that caused the downstream classification. you need to see what context was in the window at that point — was there pollution from an earlier step that biased the model? you need to compare this run against a previous successful run and see where the decision paths diverged. you need the state snapshot at every step boundary so you can reason about what the agent "knew" at each point.
none of that is a latency metric. none of it is a stack trace. none of it fits into the mental model that apm tools were built around.
apm tools answer "what happened and how long did it take." agent observability needs to answer "what did the agent decide, why, and what would have happened if it decided differently." these are fundamentally different questions. bolting agent tracing onto an apm dashboard is like using a speedometer to debug a chess game.
i've talked to teams running agents in production — customer support bots, data pipeline orchestrators, coding assistants with tool use. the pattern is the same everywhere. they have monitoring that tells them that something went wrong. they have no tooling that helps them understand why the agent behaved the way it did, or that lets them reproduce and fix the issue without re-running everything from scratch.
the failure modes are specific and worth naming. reasoning drift: the model's outputs gradually shift over a multi-step chain because context accumulates and earlier outputs bias later ones. tool selection errors: the agent picks the wrong tool at step 4, but the output looks plausible enough that nothing flags it until much later. context pollution: irrelevant information from an earlier step leaks into the prompt window and degrades performance on subsequent steps. silent degradation: the agent produces a "correct" output that's subtly different from what it produced yesterday on identical inputs, and you have no way to detect or explain the drift.
traditional monitoring catches none of these. they're not errors. they're not slow. they're just wrong, in ways that require understanding the agent's reasoning to diagnose.
and then there's the cost problem.
every time you re-run a 20-step agent chain to debug a failure at step 17, you're paying for 20 llm calls, 20 tool invocations, 20 rounds of compute. if you're using gpt-4 with large context windows, a single 20-step chain might cost $0.40-$2.00 depending on your prompts. debug it 15 times during development and you've burned $6-$30 on a single bug. multiply by the number of bugs you hit in a week and the number of developers on your team. it adds up fast. i've seen teams spending $2,000-$5,000/month purely on re-running agent chains during development. not on production traffic. on debugging.
the economics argument alone should be enough to convince you. but it's not just about money. it's about iteration speed. if debugging an agent issue takes 45 minutes of re-running chains and staring at logs, you get maybe 10-12 debug cycles in a day. if you can fork from the exact step where things went wrong, modify the input, and replay only the remaining steps in under a minute, you get hundreds of iterations. that's not an incremental improvement. that's a different way of working.
what good agent observability actually looks like
i think the right mental model is git, but for agent behavior. git lets you see every change, diff any two commits, branch from any point, and understand exactly how your code evolved from state a to state b. agent observability should give you the same thing for executions.
step-by-step tracing is the foundation. not just "step 4 took 1.2 seconds" but the full picture: what went into step 4 (the complete prompt, the context window, the tool parameters), what came out (the raw llm response, the parsed output, the tool results), and what it cost (tokens in, tokens out, dollars, wall-clock time). every step, captured automatically, without requiring the developer to instrument their agent code.
but tracing alone isn't enough. the killer feature is fork and replay.
imagine you can click on step 4 in a visual timeline, see exactly what the agent received and produced, modify the input, and hit "replay from here." steps 1 through 4 aren't re-executed — their state is already captured. the system restores the exact state at the step 4 boundary and runs steps 5 through 7 with your modified input. if the result is different, you can diff the two runs side by side: original vs. replayed, step by step, seeing exactly where the outputs diverge.
that's not a hypothetical. that's what we're building with timemachine sdk.
the approach is an sdk wrapper that sits around your agent's execution. you don't modify your agent logic. the wrapper intercepts each step boundary and captures the full state: inputs, outputs, llm prompts and responses, tool calls and results, timing, cost, the complete context window. it stores everything in postgresql with jsonb columns, which means you can query across executions with sql. the web dashboard renders the timeline, lets you click into any step, and provides the fork & replay interface.
the overhead is under 5% on agent runtime. the state capture is exact — when you fork from step 4, the restored state is byte-identical to what the original execution had at that point. this matters because agent behavior is sensitive to input. if your replay starts from an approximation of the original state, your results are meaningless.
there's another thing that becomes possible once you have full state capture: data drift detection. run the same agent with the same inputs on monday and friday. did you get the same outputs? if not, where did the paths diverge? was it the llm producing different completions (temperature, model updates)? was it a tool returning different data? was it context pollution from a different conversation history? without step-level state capture, you can't even ask these questions. with it, you can diff the two runs and see the exact step where they diverged, the exact tokens that differed, and the downstream cascade.
this is particularly important for teams trying to run agents reliably in production. your agent works in the demo. it works in staging. it fails on 3% of production traffic in ways you can't reproduce because you don't have the exact state that led to the failure. if you're capturing state at every step boundary, you can pull up any failed execution, inspect the full trace, fork from the relevant step, and reproduce the issue deterministically. that's the difference between "we think we fixed it" and "we reproduced the exact failure, identified the cause, and verified the fix on the captured state."
let me make the economics concrete. say you have a 20-step agent chain. each step involves an llm call averaging $0.08 (a mix of gpt-4 and gpt-4-mini, moderate context windows). a full run costs $1.60. during development, your team re-runs the full chain an average of 12 times per bug, and you hit roughly 8 bugs a day across the team. that's $153.60 per day, $3,072 per month, just on debugging re-runs.
now suppose you can fork and replay from the step that matters. on average, the bug is identified at step 14, and you replay from step 12. that's 8 steps skipped, 60% compute savings per debug run. your monthly debugging cost drops from $3,072 to $1,229. that's $1,843 back in your pocket, every month, from a single workflow improvement. and that's before you account for the developer time saved — the hours of staring at logs replaced by clicking through a visual timeline.
"the most expensive thing in agent development isn't the compute. it's the time your engineers spend re-running entire chains because they can't isolate a single step."
i think what's happening right now in the agent ecosystem mirrors what happened with web services around 2010-2012. everyone was building microservices and distributed systems, and the monitoring story was "check the logs." it took a few years of production pain before distributed tracing (zipkin, jaeger, eventually datadog apm) became standard. the insight was that request-response monitoring wasn't enough for distributed systems — you needed to trace a single request across service boundaries.
agents are undergoing the same transition, but the gap is wider. at least microservices had deterministic behavior — the same input produced the same output (usually). agents are stochastic. the same input can produce different outputs depending on model temperature, context ordering, and the phase of the moon. you need observability that accounts for nondeterminism, that can detect when behavior drifts, and that gives you tools to reason about probability distributions of outcomes rather than single execution paths.
most teams building agents today are going to learn this the hard way. they'll ship an agent that works beautifully in development, watch it fail unpredictably in production, and spend weeks building ad-hoc logging and replay infrastructure before realizing they need purpose-built tooling. some of them will build it themselves. some will use timemachine or something like it. either way, they'll end up in the same place: you cannot run agents in production without step-level observability and the ability to replay from arbitrary points.
the question isn't whether agent observability becomes a standard part of the stack. it's whether you adopt it before or after your first production incident where you spend three days reproducing a bug that would have taken ten minutes to diagnose with proper tooling.
the gap between "agent that works in demo" and "agent that works in production" is not a model quality problem. it's not a prompt engineering problem. it's an observability problem. and right now, almost nobody is treating it like one.