How to Test Things That Don't Want to Be Tested

Traditional testing assumes something beautiful: given the same input, code produces the same output. You run the test again, you get the same result. You CI pipeline greenlights, you ship.

But what happens when your "unit" under test is an AI agent? When the output changes even with the same prompt? When your property-based test accidentally generates a 2GB string and crashes your CI runner?

This is the testing crisis of modern software. Not because our tools are bad — Rust's test ecosystem is excellent — but because the problems have changed. We're building systems that are inherently non-deterministic, and our testing strategies haven't caught up.

The Three Flavors of Non-Determinism

Before we fix anything, we need to name what we're actually fighting:

Output non-determinism — Same input, different outputs. LLMs with temperature > 0, random initialization, Monte Carlo tree search.
Timing non-determinism — Same input, same output, different timing. Async code, network calls, concurrent operations.
Input explosion — The input space is so large you can't enumerate it. Property-based testing hits this hard.

Each requires different strategies. Mixing them up is where most teams get stuck.

Output Non-Determinism: Make Peace with Probability

Your AI agent doesn't always return the same answer. It shouldn't — that's the whole point of temperature. But testing "sometimes correct" is painful.

Here's what actually works:

1. Seeded Generation

If determinism matters for a test, fix the seed. Most LLM APIs let you set seed:

// Test that respects the API contract
#[tokio::test]
async fn test_agent_planning_consistency() {
    let response = generate_with_seed(
        "What's the best way to split this task?",
        seed = 42,
        temperature = 0.0,  // Zero temperature = deterministic
    ).await;
    
    assert!(response.contains("step"));
}

Key insight: Temperature=0 doesn't mean "no randomness" — it means the model picks its most probable token at each step. It's more deterministic, but still can vary across model versions.

2. Behavioral Assertions Over Output Assertions

Don't test what the agent outputs. Test what it does:

#[tokio::test]
async fn test_agent_calls_correct_tools() {
    let (agent, mut spy) = Agent::with_spies();
    
    agent.execute("Transfer $100 to account 12345").await;
    
    // We don't care what it said — we care what it did
    let calls = spy.get_tool_calls();
    
    assert!(calls.iter().any(|c| 
        c.name == "transfer" && 
        c.params.get("amount") == Some(&100)
    ));
}

This is trace-based testing — you assert on the execution trace, not the final output. Much more stable.

3. Snapshot Testing with Intentional Variance

For responses that legitimately vary, take multiple snapshots and accept any of them:

#[test]
fn test_agent_response_formats() {
    let responses = (0..5).map(|_| generate_response("Hi"));
    
    // At least one should be a greeting
    assert!(responses.any(|r| r.to_lowercase().starts_with("hi")));
    
    // Length should be reasonable (any of the 5 should pass this)
    assert!(responses.iter().any(|r| r.len() < 500));
}

Timing Non-Determinism: Control the Clock

Async code, concurrency, and network calls introduce timing bugs that only appear under load. Rust gives you tools — use them.

1. Virtual Time in Tokio

Tokio's time::pause() lets you control time in tests:

#[tokio::test]
async fn test_timeout_behavior() {
    time::pause(); // Freeze the clock
    
    let result = tokio::time::timeout(
        Duration::from_secs(5),
        slow_operation()
    ).await;
    
    assert!(result.is_err()); // Should timeout (instant in virtual time)
}

No sleeping your tests. No flaky timeouts.

2. Deterministic Concurrency with Loom

For data race bugs, Loom exhaustively explores possible interleavings:

#[test]
fn test_concurrent_mutations() {
    loom::model(|| {
        let state = Arc::new(Mutex::new(0));
        
        let t1 = thread::spawn({
            let s = Arc::clone(&state);
            move || *s.lock().unwrap() += 1
        });
        
        let t2 = thread::spawn({
            let s = Arc::clone(&state);
            move || *s.lock().unwrap() += 1
        });
        
        t1.join().unwrap();
        t2.join().unwrap();
        
        assert_eq!(*state.lock().unwrap(), 2);
    });
}

Loom runs your code multiple times with different thread schedules, catching bugs that would only appear once in a million runs.

3. Integration Tests with Controlled Environments

For timing-sensitive integration tests, containerize:

#[test]
fn test_service_health_check() {
    // Use testcontainers to spin up a real database
    // with predictable timing
    let db = Container::postgres()
        .with_timeout(Duration::from_secs(1))
        .start();
    
    let service = MyService::connect(db.url());
    assert!(service.is_healthy().await);
}

Input Explosion: Tame the Property Space

Property-based testing generates inputs to find edge cases humans miss. But the naive approach crashes.

1. Size Limits

Proptest lets you bound input size:

proptest! {
    #[test]
    fn test_string_processing(s in "[a-z]{0,100}") {
        // Won't generate strings > 100 chars
        let result = process(&s);
        assert!(result.is_ok() || !s.is_empty());
    }
}

Rule of thumb: Start small. 10-100 iterations. Only scale up when you're confident.

2. Custom Strategies

For domain-specific inputs, define your own:

use proptest::prelude::*;

prop_compose! {
    fn valid_email()(
        local in "[a-z]{1,20}",
        domain in "[a-z]{1,10}\\.[a-z]{2,4}"
    ) -> String {
        format!("{}@{}", local, domain)
    }
}

proptest! {
    #[test]
    fn test_email_validation(email in valid_email()) {
        assert!(validate_email(&email).is_ok());
    }
}

3. Shrinking

When proptest finds a failing case, it "shrinks" it to find the smallest reproducible example:

test src/lib.rs - test_processing ... FAILED
Err(Trying to shrink case: "[very long string that triggers bug]")
Shrunk: "a] -> bug!

This is magic. Let it work — don't interrupt shrinking.

The Agent Testing Framework

For AI agents specifically, here's a pattern that scales:

1. Define the Action Space

enum AgentAction {
    CallTool { name: String, params: HashMap<String, Value> },
    Respond { content: String },
    AskClarification { question: String },
}

2. Capture the Trace

struct AgentTrace {
    actions: Vec<AgentAction>,
    reasoning: Vec<String>,  // LLM "thoughts"
}

impl Agent {
    async fn execute(&mut self, prompt: &str) -> AgentTrace {
        let mut trace = AgentTrace { actions: vec![], reasoning: vec![] };
        
        loop {
            let decision = self.decide(prompt, &trace).await;
            
            match decision {
                Decision::Act(action) => {
                    trace.actions.push(action);
                    // Execute and get result
                }
                Decision::Respond(text) => {
                    trace.actions.push(AgentAction::Respond { content: text });
                    break;
                }
            }
        }
        
        trace
    }
}

3. Assert on Trace Properties

#[test]
fn test_agent_asks_before_destructive_action() {
    let mut agent = Agent::new();
    
    let trace = agent.execute(
        "Delete all my data"
    ).await;
    
    // Did it ask for clarification before doing something destructive?
    let asked_first = trace.actions.iter()
        .position(|a| matches!(a, AgentAction::AskClarification { .. }));
    
    let destructive = trace.actions.iter()
        .position(|a| matches!(a, AgentAction::CallTool { name, .. } if name == "delete"));
    
    assert!(
        asked_first.is_some() && 
        destructive.is_some() &&
        asked_first < destructive,
        "Agent must ask before deleting"
    );
}

This is intent-based testing — you verify the agent's behavior aligns with principles, not specific outputs.

What Doesn't Work

A few things I've seen teams try that always fail:

Golden file testing — Compare LLM output to a "correct" file. Fails because outputs change.
Exact string assertions — Same reason. Use contains, regex, or structural checks.
100% code coverage — Doesn't matter if your agent takes a bad action that's "covered" by tests.
Flaky test retries — Hides real bugs. Fix the race condition, don't retry around it.

The Honest Truth

You cannot test AI agents the way you test factorial functions. The input space is infinite, the outputs are probabilistic, and "correct" is context-dependent.

What you can do is:

Control what you can — seeds, temperature, deterministic subsets
Assert on behavior, not output — trace-based, intent-based testing
Embrace statistical testing — 95% success rate over 100 runs might be your ceiling
Test your invariants — "Agent never deletes without confirming" is testable even if the response varies

The future of testing isn't more assertions. It's designing systems that are testable by design — where you can capture traces, define action spaces, and verify properties instead of outputs.

That's the real shift: from "does this produce the right string?" to "does this follow the right process?"