Agent observability is qualitatively different from service observability. You're tracking not just 'is it up' but 'did it reason well?' Per-turn trace + cost + quality signals together give the picture; one alone misleads.

Advertisement

Trace structure

Per conversation: spans for each turn. Per turn: spans for model call, each tool call, response generation. Attributes: tokens in/out, model, tools used, latency, cost. OpenTelemetry-compatible so it integrates with existing infra.

Per-turn cost

Sum: input tokens × input price + output tokens × output price + tool call costs (API calls cost money too). Aggregate across users to find expensive patterns. Common finding: 1% of conversations use 30% of cost.

Advertisement

Quality signals

User feedback (thumbs up/down). LLM-as-judge scores on sampled conversations. Task completion (did the user reach their goal?). Tool-call accuracy (did the right tool get called with right args?). Each is partial; combine.

Sampling strategy

100% trace capture is expensive at scale. Sample by: keep all errors, all flagged-by-user conversations, all high-cost, plus 1% of normal. Eval-set worthy traces get tagged automatically; surface for human review.

Tools

Langfuse, Arize Phoenix, OpenLLMetry, LangSmith, Helicone. All do roughly the same thing at the metrics layer; differentiation in eval workflow integration. Pick by what your team will actually use.

Hierarchical traces + per-turn cost + multi-signal quality + sampling for cost. Pick the tool your team will use, not the best benchmark.