Your agent works perfectly in testing. Then it runs for three hours, processes 200 items, and crashes on the last one.
Now you've got 199 processed items, no way to resume, and a user asking where their data went.
This isn't a model problem. It's a runtime problem — and Rust already solved it.
What Most Agents Get Wrong
Most agent frameworks treat the agent as a single function:
const result = await agent.run(query)
// If this fails, you start from zero
The agent is a black box. You send inputs, wait for outputs, and hope nothing breaks mid-flight. When it does break, you have no visibility into why, and no way to recover where it failed.
This is the fundamental flaw: agents are built like functions, but they behave like processes.
A function is all-or-nothing. A process can pause, crash, and resume. Agents need to be processes.
The Durable Execution Pattern
Inngest wrote about this as "durable execution" — the idea that your agent should be able to:
- Checkpoint its state after each meaningful step
- Resume from the last checkpoint when something fails
- Replay the exact context to debug failures
But here's what their article doesn't get into: Rust already has this built in.
Enter the Async Runtime
When you use Tokio, you're not just getting an async executor. You're getting a supervisor for asynchronous tasks with built-in:
- Task spawning and management — spawn a task, let it run, cancel it if needed
- Timeout and deadline handling — don't let an agent loop forever
- Panic handling — catch failures without crashing your whole system
- Structured concurrency — tasks are tied to a lifetime, not just floating around
This is exactly what durable agents need.
Checkpointing in Rust
Here's the pattern in Rust:
use tokio::sync::Mutex;
use std::sync::Arc;
#[derive(Default)]
struct AgentState {
processed: Vec<Item>,
current_step: usize,
goal: String,
}
async fn process_with_checkpoint(
items: Vec<Item>,
state: Arc<Mutex<AgentState>>,
) -> Result<Vec<ProcessedItem>, Error> {
let mut guard = state.lock().await;
let start = guard.current_step;
for (i, item) in items.iter().enumerate().skip(start) {
let result = process_item(item).await?;
guard.processed.push(result);
guard.current_step = i + 1;
// Checkpoint — persist state here
save_checkpoint(&guard).await?;
}
Ok(guard.processed.clone())
}
If this crashes at item 47, you reload the checkpoint, and it resumes from current_step = 47. Not from zero.
Loop Detection
The Inngest article mentions agents looping forever as a common failure mode. In Tokio, you can wrap this with timeouts:
use tokio::time::{timeout, Duration};
async fn agent_with_timeout(
query: &str,
max_retries: u32,
) -> Result<Response, Error> {
let mut attempts = 0;
loop {
match timeout(Duration::from_secs(30), agent_reason(query)).await {
Ok(Ok(response)) => return Ok(response),
Ok(Err(e)) => {
attempts += 1;
if attempts >= max_retries {
return Err(e);
}
}
Err(_) => {
// Timeout — the agent is looping
attempts += 1;
if attempts >= max_retries {
return Err(Error::TimeoutLoop);
}
}
}
}
}
Now your agent can't burn $200/hour in an infinite loop. It times out, fails fast, and you can see exactly where it got stuck.
Context Snapshots
One of Inngest's key recommendations: snapshot your context before every model call. In Rust, this pairs beautifully with serde:
#[derive(Serialize)]
struct ContextSnapshot {
system_prompt: String,
tools: Vec<ToolDefinition>,
memory: Vec<MemoryEntry>,
retrieved_data: Vec<RetrievedDoc>,
timestamp: DateTime<Utc>,
input_hash: String,
}
async fn call_with_snapshot(
llm: &LlmClient,
context: &AgentContext,
prompt: &str,
) -> Result<Response, Error> {
let snapshot = ContextSnapshot {
system_prompt: context.system_prompt.clone(),
tools: context.tools.clone(),
memory: context.memory.clone(),
retrieved_data: context.retrieved.clone(),
timestamp: Utc::now(),
input_hash: hash(prompt),
};
// Store snapshot before calling — enables replay debugging
store_snapshot(&snapshot).await;
let result = llm.infer(context, prompt).await?;
// Store result alongside snapshot
store_result(&snapshot, &result).await;
Ok(result)
}
Now when your agent fails, you can replay the exact context and see what changed.
Why Tokio Over Smol?
I wrote before about Tokio being a mini operating system. For durable agents, this matters:
- Task management — spawn, join, cancel — built-in
- Timeouts —
tokio::timeis first-party, not a crate - Tracing —
tokio::tracefor observability - Maturity — battle-tested in production by thousands of companies
Smol is leaner and great for embedded or resource-constrained environments. But for agents that need to run for hours, recover from crashes, and provide observability? Tokio's the runtime that already does this.
The Runtime Is the Foundation
You can build the smartest agent, with the best prompts and the best tools. But if it crashes and forgets, it's useless in production.
The runtime isn't an implementation detail. It's the difference between:
- A function that runs once or fails completely
- A process that can pause, fail, and resume
Tokio gives you the second one. That's why your agent needs a runtime — not just for async, but for durability.
If you want to dive deeper into Tokio's internals, I wrote about how it acts like a mini OS. And if you're debugging an agent that's losing state, check out why agents lose their memory.