Building an AI agent that works in demos is easy. Building one that works in production is a completely different problem. After spending weeks on observability for my own agent (ZeroClaw), here's what actually matters.

The Debugging Nightmare

Most agent demos hide a dirty secret: when something goes wrong, you have no idea what happened. The LLM decided to do something, used a tool, and... then what? Why did it pick that tool? What was in its context? What did the tool return?

Without observability, you're flying blind.

What Actually Matters

After building this twice (once wrong, once right), here's the hierarchy:

1. Structured Tracing (Non-Negotiable)

Don't just log strings. Use structured logging with context:

tracing::info!(
    agent_id = %self.id,
    tool = %tool_name,
    decision_latency_ms = start.elapsed().as_millis(),
    "Agent selected tool"
);

This lets you filter by agent, by tool, by latency. Regex grep on plain text doesn't scale.

2. Event Sourcing

Your agent's state isn't just "what's the current message." It's a sequence of decisions:

Persist this sequence. When your agent goes off the rails at 3am, you can replay exactly what happened.

3. Spans, Not Just Spaghetti

A single agent request might touch:

Each of these should be a span with proper nesting. When something slow, you want to know what was slow, not just that "the request was slow."

The Rust Ecosystem

For ZeroClaw, I settled on:

The key insight: don't over-engineer. A simple events table with timestamp, agent_id, event_type, payload is 80% of what you need.

What I Built

My observability module now tracks:

The result: when something breaks, I can answer "what happened" in seconds instead of hours.

The Hard Part

The hardest part isn't the instrumentation. It's deciding what to instrument. Instrument everything and you have noise. Instrument too little and you can't debug.

My rule: if you'd want to know it when debugging a 3am incident, log it. If you're only logging it for fun, skip it.


ZeroClaw is my Rust-based agent daemon. This observability work is what let me actually ship it to production.