Building Agent Telemetry for LLMs

Agents are non-deterministic, meaning they are very good at breaking things. The same input can lead to a completely different sequence of events, incurring a spike in token consumption, or even a catastrophic failure. We need to monitor agents for misalignment, to mitigate errors in the short term, and understand how architect our agents in the long term.

Agents quickly fan out into subagents and tools across multiple turns, and we scope our trace accordingly. This includes the user request, the agent turn, the tools invoked, the latency, the documents referenced, as well as guardrails and evaluation metrics. We must also determine which reasoning tokens to persist as we stream our trace.

Intent

As agents are given more autonomy, they become more useful. Consecutively, telemetry also becomes more important as safety infrastructure.

Guardrails must be non-deterministic, and ultra-low latency. As a result, specialised classification models are often used, small language models trained to look at a specific agent action, or just the complete agent trajectory.

The purpose of this is to assess whether the path the agent took:

Aligned with the user's intent.
Complied with organisation policy.

Key risks include attempts to bypass restriction, exfiltrate data, or perform destructive actions.

When an agent triggers guardrails by scoring above a certain threshold, we can take a variety of actions. We can feed the telemetry back into the agent, trigger an alert, or end the session. We let the user dictate the specifics of this behaviour.

Some of the most useful operational metrics are tool call success rate and guardrail trigger rate, both broken down by tool. Measuring tool call success quickly identifies integrations that are unreliable or frequently fail under production workloads, highlighting candidates for redesign or improved error handling. Guardrail trigger rate reveals which tools frequently encroach on policy boundaries and could become a risk in the future.

Evaluation

Any production deployment will continuously work between different models, due to unavoidable factors including deprecation, cost and downtime. As a result, agent telemetry needs to be interoperable. Our telemetry allows us to evaluate how the harness and model work together, and maintaining a consistent trace allows us to evaluate crosss-model performance to improve our harness.

Each task is evaluated individually. For example, in legal document drafting, we look at completing a document in the minimum number of turns, but also other metrics such as how well the document performs at review. Automated evaluation works at scale, but ultimately, it helps for the user to read through an actual transcript, to get a sense for how the agent is performing.

Telemetry provides continuous alignment, through evidence gathered from real-world behaviour, rather than relying on pre-deployment evaluation. As a result, organisations can refine system prompts and guardrails using feedback across different models and deployments.

Agent telemetry short-term and long-term architecture