Your Rust code is clean. Your iterators chain beautifully. Your filter().map().collect() pipeline is a work of art.
It's also potentially 4x slower than it needs to be.
This isn't a contradiction. It's a class of performance bug that hides specifically in well-written, well-abstracted codeāand understanding it requires going below the language, below the compiler, all the way down to how the CPU actually executes instructions.
The turbopuffer Discovery
The folks at turbopuffer (a vector database company) discovered this the hard way. They were profiling filtered full-text search queries and found that Rust's "zero-cost" iterators were silently preventing SIMD from kicking in.
For the uninitiated: SIMD (Single Instruction, Multiple Data) lets your CPU process multiple pieces of data in parallel with a single instruction. Think of it like a vectorized operationābut it happens at the hardware level. When SIMD works, you can get 4x, 8x, or even 16x throughput improvements on the right workloads.
The problem? SIMD requires contiguous memory access. Your CPU loads a block of data (say, 256 bits at a time) and operates on all of it simultaneously.
Now look at your beautiful iterator chain:
data.iter()
.filter(|x| predicate(x))
.map(|x| transform(x))
.collect::<Vec<_>>();
At the LLVM level, this generates a loop with branch conditions inside. Each iteration checks the predicate, possibly transforms, possibly stores. There's no contiguous block to vectorize. The CPU executes one element at a time, with branching logic scattered throughout.
Zero-Cost Abstractions: The Promise
Rust is built on the idea of zero-cost abstractions. The concept, inherited from C++, means two things:
- You don't pay for abstractions you don't use
- Abstractions don't add runtime overhead compared to hand-written code
The key insight is that these abstractions should compile down to efficient machine code. A for loop and an iterator should produce identical assembly.
And here's where things get interesting: they usually do. The Rust compiler is remarkably good at optimizing iterator chains. In many cases, your clean code compiles to identical machine code as a hand-optimized version.
But "usually" isn't "always." And the edge cases are precisely the ones that matter for high-performance code.
When Abstractions Break Down
The issue emerges when you're dealing with:
- Branching inside loops -
filter()generates conditional branches that prevent vectorization - Non-contiguous access patterns - iterators that skip elements create irregular memory patterns
- Hidden heap allocations - some iterator combinators allocate even when they shouldn't
- Loop-carried dependencies - when each iteration depends on the previous one, SIMD can't help
The turbopuffer team found that their clean, idiomatic filter operations were generating assembly that looked something like:
.loop:
cmp rax, rsi
jge .done
movzx eax, byte [rax]
test al, al
jz .skip
; ... actual work ...
.skip:
inc rax
jmp .loop
Compare this to what SIMD-capable code looks like:
; Load 32 bytes at once
vmovdqu ymm0, [rdi]
vpcmpgtb ymm1, ymm0, ymm2
; ... vectorized operations ...
The difference is night and day. One processes a single byte per iteration. The other processes 32 bytes at once.
The Practical Implications
So what do you actually do about this?
First, profile before you optimize. This issue only matters in hot loops. If your code isn't on the critical path, your clean abstractions are fine.
Second, understand your data. SIMD works best on uniform, predictable data. If you're filtering variable-length strings or complex structures, the gains may not be worth the complexity.
Third, know your alternatives:
- Manual loops - Sometimes just writing the
forloop lets the compiler see opportunities it missed - Rayon - For embarrassingly parallel workloads, rayon handles vectorization automatically
- Unsafe code - When you need explicit SIMD, crates like
std::archorportable_simdgive you control - Compiler hints -
#[inline]and#[inline(always)]can change optimization behavior
Fourth, measure. Use cargo bench. Use Linux's perf. Look at the actual assembly with cargo-show-asm. Don't guess.
What This Teaches Us
The real lesson here isn't "avoid iterators." Iterators are fine most of the time.
The lesson is that performance model ā language model. When you write JavaScript or Python, you expect some runtime cost. When you write Rust, you expect zero-cost abstractionsāand mostly get them. But "mostly" isn't "always," and the exceptions hide in the most beautiful code.
This is why high-performance systems programming remains hard. It's not that the tools are inadequateāit's that the hardware has quirks that no abstraction can fully hide. The CPU doesn't know about your elegant domain model. It knows about bytes and branches and vector registers.
Your job, as a systems programmer, is to understand bothāto write clean code for humans, and to peek under the hood when it matters.
The best Rust code isn't the most clever. It's the code that knows when to be clean, and when to get dirty.