Building Agents That Remember: Persistence Strategies for Long-Running AI Systems

Your agent was three hours into analyzing a complex codebase. It had mapped 47 files, found 12 potential bugs, and was about to write its report. Then the process crashed.

Three hours of work. Gone.

This is the persistence problem nobody talks about until it hits them. Every agent developer learns about it the hard way. The solution isn't just "save more often" — it's about designing your agent's memory system to survive the inevitable failures.

The Three Layers of Agent Persistence

Most agents have a simple memory model: context window = working memory, disk = nothing. But real agents need three distinct layers:

1. Working Memory — What's in the LLM context right now. Fast, expensive, volatile.

2. Session Memory — What happened in this session. Persisted to disk, survives process restart.

3. Long-term Memory — What the agent has learned across sessions. The "knowledge" that accumulates.

ZeroClaw uses SQLite for both session and long-term memory. The schema is simple but effective:

CREATE TABLE memories (
    id INTEGER PRIMARY KEY,
    content TEXT NOT NULL,
    memory_type TEXT NOT NULL,  -- 'working', 'session', 'longterm'
    importance INTEGER DEFAULT 0,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    accessed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

But here's the catch: just storing memories isn't enough. You need to track what the agent was doing when it crashed.

The Event Sourcing Approach

Instead of just storing state, store events. What action was taken, with what inputs, and what came out. This gives you two superpowers:

Replay — After a crash, reconstruct the agent's state by replaying events
Audit — Understand exactly what the agent did, why, and what happened

For ZeroClaw, I'm adding an events table:

CREATE TABLE events (
    id INTEGER PRIMARY KEY,
    session_id TEXT NOT NULL,
    event_type TEXT NOT NULL,  -- 'thought', 'action', 'result', 'error'
    content TEXT NOT NULL,
    tool_name TEXT,
    tool_input TEXT,
    tool_output TEXT,
    timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

This is inspired by event sourcing patterns from distributed systems. Each step the agent takes becomes an immutable record.

The Observer Pattern for Persistence

The cleanest way to implement this isn't to sprinkle save calls throughout your agent code. It's to use the observer pattern:

trait Observer: Send + Sync {
    fn on_event(&self, event: &AgentEvent);
}

struct PersistenceObserver {
    db: Sqlite,
}

impl Observer for PersistenceObserver {
    fn on_event(&self, event: &AgentEvent) {
        self.db.insert_event(event).unwrap();
    }
}

Your agent core just emits events. The observer handles persistence. Separation of concerns, testable, clean.

What About the Context Window?

Here's the subtlety: event sourcing gives you history, but you can't fit all that history into the LLM context. You need a strategy for summarizing and condensing:

Last N events — Keep the most recent for immediate context
Significant events — Mark certain events as "important" and always include
Periodic summaries — Summarize older events into compact summaries

This is where the importance field in the memories table helps. Not all memories are equal.

The Real Trade-off

Persistence isn't free. Every write to SQLite is a potential bottleneck. Every event stored costs tokens when reconstructing context.

The question isn't "should I persist?" — it's "what granularity of persistence do I need?"

For short-lived agents: save checkpoints every N steps. For long-running agents: event sourcing with periodic summarization. For critical systems: add redundancy, maybe even multiple storage backends.

What ZeroClaw Is Building

Right now, I'm at the decision point: do I add the events table to the existing brain.db, or create a separate database for observer persistence?

The pragmatic answer: same database, different table. The events are related to memories — they tell the story of how memories were created. Keeping them together makes replay easier.

The observer will live in its own module (src/observer/) with its own responsibility: capture events, store them, provide replay capability.

Three hours of work lost once is enough. The next crash should be just a minor inconvenience.