What if I told you that adding 1,000 numbers together doesn't require 1,000 CPU instructions?
It doesn't. A modern CPU can add 8 integers at once. Or 16. Or 32, if you're on the right architecture. That's SIMD: Single Instruction, Multiple Data. And Rust's std::simd makes it accessible.
The Problem: Wasting 7/8 of Your CPU
Here's a loop you've written a thousand times:
fn sum(values: &[f32]) -> f32 {
let mut total = 0.0;
for v in values {
total += v;
}
total
}
This is correct. It's readable. It's also potentially slow — you're using one CPU lane when you could be using eight.
The compiler might auto-vectorize this. It often does. But "might" isn't a strategy. Sometimes the aliasing analysis is too conservative. Sometimes the loop is just complex enough to confuse the optimizer. And sometimes you need explicit control because you're doing something the compiler can't see.
That's where std::simd comes in.
The Solution: Explicit Vectorization
Rust's standard library includes std::simd, now stable. It gives you Simd<T, N> — a vector type that looks like an array but operates like a single value:
use std::simd::{f32x4, SimdFloat};
fn sum_simd(values: &[f32]) -> f32 {
let mut total = f32x4::splat(0.0);
// Process 4 elements at a time
let chunks = values.chunks(4);
for chunk in chunks {
let v = f32x4::from_array([chunk[0], chunk[1], chunk[2], chunk[3]]);
total += v;
}
// Horizontal sum: add all lanes together
total.reduce_sum()
}
Four additions become one. On AVX-512 hardware, you could do 16 at once.
When to Use It
Honest answer: probably not often. Here's why:
-
Auto-vectorization is good. For most loops, the compiler does the right thing. Adding explicit SIMD often doesn't change performance.
-
It's a readability trade-off. The readable version usually wins until you've profiled and proven the hot path needs help.
-
Portable SIMD has fixed widths.
f32x4works everywhere. But your CPU might supportf32x8orf32x16. You can't easily adapt at runtime.
So when does it make sense?
- Image/pixel processing — per-pixel operations that are obviously parallel
- Game physics — large arrays of positions, velocities
- Audio processing — sample-by-sample transforms
- Scientific computing — matrix operations, FFTs, anything with tight inner loops
A Real Example: Grayscale Conversion
Here's something I needed for rust-sketch — converting RGB pixels to grayscale:
// Naive: one pixel at a time
fn grayscale_naive(pixels: &mut [u8]) {
for i in (0..pixels.len()).step_by(3) {
let r = pixels[i] as f32;
let g = pixels[i + 1] as f32;
let b = pixels[i + 2] as f32;
let gray = (0.299 * r + 0.587 * g + 0.114 * b) as u8;
pixels[i] = gray;
pixels[i + 1] = gray;
pixels[i + 2] = gray;
}
}
// SIMD: process 4 pixels (12 bytes) at once
use std::simd::{u8x16, SimdUint};
fn grayscale_simd(pixels: &mut [u8]) {
let mut i = 0;
while i + 12 <= pixels.len() {
// Load 16 bytes (5+ pixels, some overlap)
let rgba = u8x16::from_slice(&pixels[i..]);
// Extract RGB lanes: R=0,1,2, G=3,4,5, B=6,7,8 (every 3rd byte)
// This is trickier in practice - showing concept only
// ...
i += 12;
}
// Handle remainder
}
The real code is more involved (extracting every 3rd byte from a SIMD vector requires shuffle operations). But the principle is clear: instead of 3 multiplications per pixel, you're doing 3 multiplications per four pixels.
The Bigger Picture
SIMD is one tool in a larger toolbox:
- First, measure. Profile before optimizing.
cargo benchis your friend. - Then, check auto-vectorization. Use
rustup target addfor your actual CPU features. Look at the assembly withcargo-show-asm. - Then, consider explicit SIMD. If the compiler isn't vectorizing and you've proven it's a bottleneck.
Remember: the goal isn't SIMD. The goal is performance. And readable, idiomatic Rust that happens to be fast usually beats clever vectorization that nobody can maintain.
What's Next
This is part of a thread on Data-Oriented Design in Rust. We covered:
- Structure of Arrays vs Array of Structures
- Cache locality and you
- ECS patterns in game engines
- And now: explicit vectorization with SIMD
Coming next: bytemuck for zero-cost casting, and how to build truly contiguous data layouts that the compiler loves.
SIMD is fast. But the fastest code is code you don't need to run. That's the real lesson from DOD: arrange your data right, and the hardware does the rest.