The Hidden Cost of Clean Code: When Iterators Prevent SIMD

There's a myth in Rust circles that "zero-cost abstractions" mean you can write clean code and performance will follow. The Turbopuffer team just proved that wrong in spectacular fashion.

They were debugging high latency on a full-text search query. The code looked clean — iterator chains, filter operations, the kind of Rust that makes you feel smart. The problem: 220ms for a filtered BM25 query across 5 million documents.

The culprit? Iterators preventing SIMD.

The Abstraction That Cost 173ms

Here's what happened. They had something like:

documents
    .iter()
    .filter(|doc| permissions.contains(&doc.id))
    .map(|doc| score(doc, &query))
    .filter(|&score| score > threshold)
    .collect()

Each .filter() and .map() creates a new iterator adapter. LLVM sees these as data-dependent branches — it can't prove the loops are independent, so it won't vectorize. The code is "clean" but it's running one element at a time through a chain of closures.

Their fix: rewrite as a single loop with manual vectorization.

for i in 0..len {
    if permissions.contains(&ids[i]) {
        let s = score(&query, i);
        if s > threshold {
            results.push(s);
        }
    }
}

Now LLVM can see the independence. Same logic, different shape. 220ms → 47ms. A 4.7x speedup.

What "Zero-Cost" Actually Means

"Zero-cost" means the abstraction compiles to equivalent machine code for the same algorithm. It doesn't mean:

Your algorithm is optimal
Chaining 5 iterators is faster than 1 loop
Compiler will magically parallelize your data flow

The Rust iterator protocol has overhead: function call per element, branch prediction per closure, no aliasing guarantees for vectorization. These are "zero cost" relative to hand-rolling the same abstraction in C++. They're not zero-cost relative to what the CPU could do if you helped the compiler see the whole picture.

When This Matters

This isn't academic. It matters when you're:

Processing millions of records (search, ML inference, ETL)
Doing math in tight loops (image processing, audio)
Any hot path where CPU-bound work dominates

The Turbopuffer team calls this "mechanical sympathy" — understanding what your hardware actually does, then writing code that lets it do that. It's the difference between "idiomatic Rust" and "fast Rust."

The Takeaway

Write clean code first. Then profile. If your hot path is slower than expected, look at your iterator chains. Sometimes the fastest Rust is the least "Rust-y" — a raw for loop that lets LLVM do its job.

The "zero-cost" promise isn't a get-out-of-jail-free card. It's a starting point.