Autonomous agents are only as trustworthy as the window you have into what they're doing. At Agnotiq, telemetry is not an afterthought — it's the first thing we wire before handing an agent any real work.
This post walks through the shape of our trace schema, the five dashboard metrics we watch daily, and — equally important — the alerts we deliberately skip so your team doesn't drown in noise.
The shape of our trace schema
Every agent run produces an OpenTelemetry-compatible JSON trace. The document captures the full journey from first prompt to final status, so any operator — engineer or business owner — can replay exactly what happened and why.
trace_id / span_id"a3f7c8d1-…"Links every action in a single agent run into one queryable timeline.
attributes{ model, prompt, max_iter }Input prompt, model name (e.g. claude-sonnet-4-6), and iteration cap.
events[ ]LLM step · tool call · approvalEvery thinking step, tool invocation, and human-in-the-loop gate.
status"ok" | "error"Terminal status — the single field that drives your P95 success-rate metric.
trace_idThe key insight in our schema design: events are append-only. We never overwrite a step. That means a post-mortem can walk the entire decision tree as it actually unfolded, not a reconstructed approximation.
Dashboards that drive decisions
We track five core metrics on always-on dashboards. Everything else is derived from these — there is no “maybe useful later” column on our boards.
Faster agents = happier customers
Minimize manual fixes
Controls AI compute costs
Predictable monthly spend
Reliable automation ROI
The success-rate number looks great on its face. What makes it trustworthy is the 1.5 % that failed: we know exactly which tool call timed out, which approval sat in queue too long, and what prompt caused the off-rail. That failure visibility is how you defend a 98.5 % figure to a customer without crossing your fingers.
Every failure was traced to a specific tool timeout or approval delay — no invisible errors.
Alerts we don't bother with
Alert fatigue is a real cost. When everything pages, nothing gets fixed — the team learns to ignore the channel. We made an explicit decision early on: no alert fires unless a human can take a meaningful action within the hour.
These three hit your P&L directly. Everything else gets logged and reviewed in the weekly retro, not in Slack at 2 am.
The three things that fire — cost spikes, approval backlogs, and compliance gaps — share one property: a human decision in the next hour changes the outcome. Minor LLM quirks and cloud latency bumps don't meet that bar, so they don't get a page.
What this gives your business
The trace schema and dashboard are not engineering vanity metrics. They are the artifact your operations team reads when something feels off, and the evidence your finance team reaches for when they want to justify the line item. Properly structured telemetry typically lets Agnotiq customers cut AI compute costs 30–50 % within the first quarter — not because the agents change, but because the visibility exposes waste that was invisible before.
Bottom line
Telemetry is not tech debt you pay off later. It is the control panel you need from day one. With the right schema, five focused metrics, and an alert policy built around business impact rather than engineering noise, your agents stop being a black box and start being something you can defend, optimize, and scale with confidence.
If you'd like to walk through how we'd instrument your specific workflow, drop us a line at hello@agnotiq.com. The logs already know the answer.