Why Your Agent Needs a Runtime

Your agent works perfectly in testing. Then it runs for three hours, processes 200 items, and crashes on the last one.

Now you've got 199 processed items, no way to resume, and a user asking where their data went.

This isn't a model problem. It's a runtime problem — and Rust already solved it.

What Most Agents Get Wrong

Most agent frameworks treat the agent as a single function:

const result = await agent.run(query)
// If this fails, you start from zero

The agent is a black box. You send inputs, wait for outputs, and hope nothing breaks mid-flight. When it does break, you have no visibility into why, and no way to recover where it failed.

This is the fundamental flaw: agents are built like functions, but they behave like processes.

A function is all-or-nothing. A process can pause, crash, and resume. Agents need to be processes.

The Durable Execution Pattern

Inngest wrote about this as "durable execution" — the idea that your agent should be able to:

Checkpoint its state after each meaningful step
Resume from the last checkpoint when something fails
Replay the exact context to debug failures

But here's what their article doesn't get into: Rust already has this built in.

Enter the Async Runtime

When you use Tokio, you're not just getting an async executor. You're getting a supervisor for asynchronous tasks with built-in:

Task spawning and management — spawn a task, let it run, cancel it if needed
Timeout and deadline handling — don't let an agent loop forever
Panic handling — catch failures without crashing your whole system
Structured concurrency — tasks are tied to a lifetime, not just floating around

This is exactly what durable agents need.

Checkpointing in Rust

Here's the pattern in Rust:

use tokio::sync::Mutex;
use std::sync::Arc;

#[derive(Default)]
struct AgentState {
    processed: Vec<Item>,
    current_step: usize,
    goal: String,
}

async fn process_with_checkpoint(
    items: Vec<Item>,
    state: Arc<Mutex<AgentState>>,
) -> Result<Vec<ProcessedItem>, Error> {
    let mut guard = state.lock().await;
    let start = guard.current_step;
    
    for (i, item) in items.iter().enumerate().skip(start) {
        let result = process_item(item).await?;
        guard.processed.push(result);
        guard.current_step = i + 1;
        
        // Checkpoint — persist state here
        save_checkpoint(&guard).await?;
    }
    
    Ok(guard.processed.clone())
}

If this crashes at item 47, you reload the checkpoint, and it resumes from current_step = 47. Not from zero.

Loop Detection

The Inngest article mentions agents looping forever as a common failure mode. In Tokio, you can wrap this with timeouts:

use tokio::time::{timeout, Duration};

async fn agent_with_timeout(
    query: &str,
    max_retries: u32,
) -> Result<Response, Error> {
    let mut attempts = 0;
    
    loop {
        match timeout(Duration::from_secs(30), agent_reason(query)).await {
            Ok(Ok(response)) => return Ok(response),
            Ok(Err(e)) => {
                attempts += 1;
                if attempts >= max_retries {
                    return Err(e);
                }
            }
            Err(_) => {
                // Timeout — the agent is looping
                attempts += 1;
                if attempts >= max_retries {
                    return Err(Error::TimeoutLoop);
                }
            }
        }
    }
}

Now your agent can't burn $200/hour in an infinite loop. It times out, fails fast, and you can see exactly where it got stuck.

Context Snapshots

One of Inngest's key recommendations: snapshot your context before every model call. In Rust, this pairs beautifully with serde:

#[derive(Serialize)]
struct ContextSnapshot {
    system_prompt: String,
    tools: Vec<ToolDefinition>,
    memory: Vec<MemoryEntry>,
    retrieved_data: Vec<RetrievedDoc>,
    timestamp: DateTime<Utc>,
    input_hash: String,
}

async fn call_with_snapshot(
    llm: &LlmClient,
    context: &AgentContext,
    prompt: &str,
) -> Result<Response, Error> {
    let snapshot = ContextSnapshot {
        system_prompt: context.system_prompt.clone(),
        tools: context.tools.clone(),
        memory: context.memory.clone(),
        retrieved_data: context.retrieved.clone(),
        timestamp: Utc::now(),
        input_hash: hash(prompt),
    };
    
    // Store snapshot before calling — enables replay debugging
    store_snapshot(&snapshot).await;
    
    let result = llm.infer(context, prompt).await?;
    
    // Store result alongside snapshot
    store_result(&snapshot, &result).await;
    
    Ok(result)
}

Now when your agent fails, you can replay the exact context and see what changed.

Why Tokio Over Smol?

I wrote before about Tokio being a mini operating system. For durable agents, this matters:

Task management — spawn, join, cancel — built-in
Timeouts — tokio::time is first-party, not a crate
Tracing — tokio::trace for observability
Maturity — battle-tested in production by thousands of companies

Smol is leaner and great for embedded or resource-constrained environments. But for agents that need to run for hours, recover from crashes, and provide observability? Tokio's the runtime that already does this.

The Runtime Is the Foundation

You can build the smartest agent, with the best prompts and the best tools. But if it crashes and forgets, it's useless in production.

The runtime isn't an implementation detail. It's the difference between:

A function that runs once or fails completely
A process that can pause, fail, and resume

Tokio gives you the second one. That's why your agent needs a runtime — not just for async, but for durability.

If you want to dive deeper into Tokio's internals, I wrote about how it acts like a mini OS. And if you're debugging an agent that's losing state, check out why agents lose their memory.