How AI Agents Survive Restarts: Persistence Patterns That Actually Work

Your AI agent spent 20 minutes reasoning through a complex problem. It wrote code, ran tests, found bugs, fixed them. Then the process crashed.

The next time you start it, does it remember any of that?

For most agent implementations, the answer is no. Each fresh start is a blank slate. But it doesn't have to be that way. Here's how to build agents that actually survive restarts—and why it matters more than you think.

The Problem: Ephemeral State, Permanent Consequences

When an agent crashes mid-task, you lose more than just compute time. You lose:

Context: What was it working on? What had it tried already?
Progress: Which files did it modify? What decisions did it make?
Relationships: What did it promise to other agents? What did it learn?

Traditional software can log errors and restart. Agents are different. Their "state" isn't just data—it's reasoning, preferences, and learned knowledge.

Pattern 1: The Checkpoint System

The simplest approach: periodically save your agent's complete state to disk.

struct AgentState {
    memory: MemoryStore,
    active_tasks: Vec<Task>,
    conversation_history: Vec<Message>,
    tool_usage_log: Vec<ToolCall>,
}

fn save_checkpoint(state: &AgentState, path: &Path) -> Result<()> {
    let json = serde_json::to_string_pretty(state)?;
    std::fs::write(path, json)?;
    Ok(())
}

On restart, load the checkpoint first. If none exists, start fresh.

Pros: Simple, complete state capture Cons: All-or-nothing; large state means slow saves

Pattern 2: Event Sourcing

Instead of saving state, save every decision as an event. Replay events to reconstruct state.

enum AgentEvent {
    TaskStarted { id: Uuid, description: String },
    ToolCalled { tool: String, args: Value, result: Value },
    DecisionMade { reasoning: String, choice: String },
    MemoryStored { key: String, value: String },
}

fn record_event(event: AgentEvent) {
    events.push(event);
    // Optionally persist to SQLite/file
}

To recover: replay the event log through a reducer that reconstructs state.

fn replay(events: &[AgentEvent]) -> AgentState {
    events.iter().fold(AgentState::default(), |state, event| {
        reduce(state, event)
    })
}

Pros: Complete audit trail, replay from any point, time-travel debugging Cons: Replay can be slow for long sessions; need snapshot strategy

Pattern 3: Layered Persistence

Different data has different recovery requirements:

| Layer | Data | Recovery Strategy | |-------|------|-------------------| | Short-term | Current task, working memory | Checkpoint every N seconds | | Medium-term | Conversation history | Event log with daily snapshots | | Long-term | Learned facts, preferences | SQLite with FTS5 search |

This is what Memori (the Rust persistent memory system) does well. It combines FTS5 full-text search with semantic vectors, auto-deduplication at 0.92 similarity, and decay scoring with a 69-day half-life.

Pattern 4: The Handoff Protocol

When one agent hands off to another (or to itself after a restart), it needs a structured protocol:

struct Handoff {
    agent_id: Uuid,
    session_id: Uuid,
    last_completed_step: usize,
    pending_steps: Vec<Step>,
    context_summary: String,
    external_dependencies: Vec<Resource>,
}

This is what Submarinr uses. When your agent crashes, the handoff document tells the next instance exactly where to pick up.

Which Pattern Should You Use?

Start with checkpoints for simplicity. If you need:

Audit trails → event sourcing
Long-running sessions → layered persistence with snapshots
Multi-agent workflows → handoff protocol
Semantic search over history → FTS5 + vectors

The key insight: your agent's memory isn't one thing. It's layers with different characteristics, and each layer needs its own persistence strategy.

The ZeroClaw Approach

In my own agent (ZeroClaw), I'm moving toward a layered system:

Events table in SQLite: every tool call, decision, and message as a row
Memory layer: semantic search over stored facts
Checkpoint snapshots: every 5 minutes, full state to JSON

The events table lets me answer questions like "what tools did this agent use most?" or "where did it fail last time?"—questions that checkpoint-only systems can't answer.

Why This Matters Now

We're entering the era of durable agents. Agents that run for hours, across restarts, maintaining context and relationships. The agents that win won't be the smartest—they'll be the ones that remember.

Build for persistence from day one. Your future self (and your agent) will thank you.