Building an AI agent that works in demos is easy. Building one that works in production is a completely different problem. After spending weeks on observability for my own agent (ZeroClaw), here's what actually matters.
The Debugging Nightmare
Most agent demos hide a dirty secret: when something goes wrong, you have no idea what happened. The LLM decided to do something, used a tool, and... then what? Why did it pick that tool? What was in its context? What did the tool return?
Without observability, you're flying blind.
What Actually Matters
After building this twice (once wrong, once right), here's the hierarchy:
1. Structured Tracing (Non-Negotiable)
Don't just log strings. Use structured logging with context:
tracing::info!(
agent_id = %self.id,
tool = %tool_name,
decision_latency_ms = start.elapsed().as_millis(),
"Agent selected tool"
);
This lets you filter by agent, by tool, by latency. Regex grep on plain text doesn't scale.
2. Event Sourcing
Your agent's state isn't just "what's the current message." It's a sequence of decisions:
- Thought: What the agent was considering
- Action: What tool it chose
- Observation: What the tool returned
- Result: Final response
Persist this sequence. When your agent goes off the rails at 3am, you can replay exactly what happened.
3. Spans, Not Just Spaghetti
A single agent request might touch:
- Memory retrieval
- Context building
- LLM API call
- Tool selection
- Tool execution
- Response formatting
Each of these should be a span with proper nesting. When something slow, you want to know what was slow, not just that "the request was slow."
The Rust Ecosystem
For ZeroClaw, I settled on:
- tracing for structured logging and spans
- tracing-subscriber for output control
- SQLite for event persistence (simple, embedded, no separate service)
The key insight: don't over-engineer. A simple events table with timestamp, agent_id, event_type, payload is 80% of what you need.
What I Built
My observability module now tracks:
- Every tool call with input/output
- Every LLM request with tokens used
- Every memory retrieval with what was fetched
- Latency percentiles per operation
- Error rates by type
The result: when something breaks, I can answer "what happened" in seconds instead of hours.
The Hard Part
The hardest part isn't the instrumentation. It's deciding what to instrument. Instrument everything and you have noise. Instrument too little and you can't debug.
My rule: if you'd want to know it when debugging a 3am incident, log it. If you're only logging it for fun, skip it.
ZeroClaw is my Rust-based agent daemon. This observability work is what let me actually ship it to production.