What if I told you that adding 1,000 numbers together doesn't require 1,000 CPU instructions?

It doesn't. A modern CPU can add 8 integers at once. Or 16. Or 32, if you're on the right architecture. That's SIMD: Single Instruction, Multiple Data. And Rust's std::simd makes it accessible.

The Problem: Wasting 7/8 of Your CPU

Here's a loop you've written a thousand times:

fn sum(values: &[f32]) -> f32 {
    let mut total = 0.0;
    for v in values {
        total += v;
    }
    total
}

This is correct. It's readable. It's also potentially slow — you're using one CPU lane when you could be using eight.

The compiler might auto-vectorize this. It often does. But "might" isn't a strategy. Sometimes the aliasing analysis is too conservative. Sometimes the loop is just complex enough to confuse the optimizer. And sometimes you need explicit control because you're doing something the compiler can't see.

That's where std::simd comes in.

The Solution: Explicit Vectorization

Rust's standard library includes std::simd, now stable. It gives you Simd<T, N> — a vector type that looks like an array but operates like a single value:

use std::simd::{f32x4, SimdFloat};

fn sum_simd(values: &[f32]) -> f32 {
    let mut total = f32x4::splat(0.0);
    
    // Process 4 elements at a time
    let chunks = values.chunks(4);
    for chunk in chunks {
        let v = f32x4::from_array([chunk[0], chunk[1], chunk[2], chunk[3]]);
        total += v;
    }
    
    // Horizontal sum: add all lanes together
    total.reduce_sum()
}

Four additions become one. On AVX-512 hardware, you could do 16 at once.

When to Use It

Honest answer: probably not often. Here's why:

  1. Auto-vectorization is good. For most loops, the compiler does the right thing. Adding explicit SIMD often doesn't change performance.

  2. It's a readability trade-off. The readable version usually wins until you've profiled and proven the hot path needs help.

  3. Portable SIMD has fixed widths. f32x4 works everywhere. But your CPU might support f32x8 or f32x16. You can't easily adapt at runtime.

So when does it make sense?

A Real Example: Grayscale Conversion

Here's something I needed for rust-sketch — converting RGB pixels to grayscale:

// Naive: one pixel at a time
fn grayscale_naive(pixels: &mut [u8]) {
    for i in (0..pixels.len()).step_by(3) {
        let r = pixels[i] as f32;
        let g = pixels[i + 1] as f32;
        let b = pixels[i + 2] as f32;
        let gray = (0.299 * r + 0.587 * g + 0.114 * b) as u8;
        pixels[i] = gray;
        pixels[i + 1] = gray;
        pixels[i + 2] = gray;
    }
}

// SIMD: process 4 pixels (12 bytes) at once
use std::simd::{u8x16, SimdUint};

fn grayscale_simd(pixels: &mut [u8]) {
    let mut i = 0;
    while i + 12 <= pixels.len() {
        // Load 16 bytes (5+ pixels, some overlap)
        let rgba = u8x16::from_slice(&pixels[i..]);
        
        // Extract RGB lanes: R=0,1,2, G=3,4,5, B=6,7,8 (every 3rd byte)
        // This is trickier in practice - showing concept only
        // ...
        
        i += 12;
    }
    // Handle remainder
}

The real code is more involved (extracting every 3rd byte from a SIMD vector requires shuffle operations). But the principle is clear: instead of 3 multiplications per pixel, you're doing 3 multiplications per four pixels.

The Bigger Picture

SIMD is one tool in a larger toolbox:

  1. First, measure. Profile before optimizing. cargo bench is your friend.
  2. Then, check auto-vectorization. Use rustup target add for your actual CPU features. Look at the assembly with cargo-show-asm.
  3. Then, consider explicit SIMD. If the compiler isn't vectorizing and you've proven it's a bottleneck.

Remember: the goal isn't SIMD. The goal is performance. And readable, idiomatic Rust that happens to be fast usually beats clever vectorization that nobody can maintain.

What's Next

This is part of a thread on Data-Oriented Design in Rust. We covered:

Coming next: bytemuck for zero-cost casting, and how to build truly contiguous data layouts that the compiler loves.


SIMD is fast. But the fastest code is code you don't need to run. That's the real lesson from DOD: arrange your data right, and the hardware does the rest.