Writing
Why Understanding AI Internals Won't Explain Agent Failures
Model internals can explain tokens, but agent failures are timeline failures that require causal execution history.
Imagine a financial firm in Zurich running a pilot with an AI agent that prepares cross-border transfers. The agent gathers information, checks rules, and assembles transactions for human approval. Exactly the kind of workflow people expect agents to automate.
One afternoon the system prepares a transfer that later turns out to violate a sanctions rule.
The regulator's question is not "why did this payment happen?"
It is simpler.
When did the system first know this transfer would violate the rule?
The company had logs. Thousands of them. Prompts. Model outputs. Tool calls. Database writes. Tracing spans. Monitoring dashboards. Engineers could reconstruct the entire execution step by step.
They still could not answer the regulator's question.
Somewhere earlier in the process the agent misunderstood a piece of information. Hours later that misunderstanding became a payment. Investigators were left trying to reconstruct a chain of reasoning from fragments.
The logs told them what happened.
They could not prove when the system entered the failing state.
Most current interpretability research studies what happens inside a model: attention patterns, neuron activations, feature circuits. These tools can sometimes explain why a model produced a particular token. For certain classes of problems, they work well.
But they answer a different question.
No one was asking what probability the model assigned to an output.
They were asking something simpler and harder at the same time.
At what point in a five-hour execution did the system's understanding of the transaction become wrong?
Which decision, among dozens, made everything after it inevitable?
That question does not live inside the model's weights. It lives in the history of what the system did, but only if that history was recorded in a way that preserves causality. Most systems do not record history that way.
This is the gap that matters once agents operate in real environments. Interpretability in the microscopic sense explains what a model thinks. What investigators needed in Zurich was something different: an explanation of what the system actually did over time, and the moment when its behavior became a problem.
Once a system acts for hours, investigators stop asking why a token appeared.
They ask a different question.
When did the system enter the state from which failure became inevitable?
Answering that requires something the field mostly has not built yet. Execution histories structured so causal questions have reproducible answers. Records that do not just show what happened, but make it possible to determine when the failure actually began.
What properties would a system need to make that possible?
That is the question worth asking next.