How to Make Python Fast with Rust: The Data-Oriented Way

If you've ever tried to speed up Python by rewriting hot functions in Rust, you might have gotten a surprise: just adding PyO3 bindings often makes things slower. The overhead of converting between Python objects and Rust data structures eats up all your gains.

But it doesn't have to be that way. A new approach called data-oriented design — combined with smart Python-Rust interop — can actually beat NumPy, even for Python-native APIs.

The Naive Approach Fails

The obvious way to speed up Python with Rust is to take your hot Python functions and rewrite them:

use std::collections::HashMap;
use pyo3::prelude::*;

#[pyfunction]
fn add_scalar(d: HashMap<String, f64>, value: f64) -> HashMap<String, f64> {
    d.into_iter().map(|(k, v)| (k, v + value)).collect()
}

This should be faster. Rust is fast, right?

The benchmarks say otherwise. This approach runs 3-5x slower than native Python. Every function call requires converting data from Python to Rust and back. All that marshalling overhead adds up fast.

The lesson: just wrapping functions isn't enough. You need to rethink the whole data flow.

The Key Insight: Convert Once, Operate Many Times

The fix is simple but powerful: instead of converting data for every operation, convert once at the start, run many operations on the Rust side, then convert back once at the end.

You do this by creating a custom Python class in Rust:

#[pyclass]
struct RedDict {
    values: Arc<HashMap<String, f64>>,
}

#[pymethods]
impl RedDict {
    #[new]
    fn new(dict: HashMap<String, f64>) -> Self {
        Self { values: Arc::new(dict) }
    }

    fn add(&self, other: &Self, fill: f64) -> Self {
        let values = self.values.iter()
            .map(|(k, v)| (k.clone(), v + other.values.get(k).unwrap_or(&fill)))
            .collect();
        Self { values: Arc::new(values) }
    }
}

Now you get a Python API like this:

rd = RedDict(py_dict)
rd2 = rd.add_scalar(2).multiply(10)

And the performance jumps. One benchmark showed a 4x improvement over naive PyO3 — from 1.066 seconds down to 0.240 seconds for 100,000 operations.

The Data-Oriented Revolution

But we can go further. HashMaps are great for random access, but for element-wise operations, vectors beat them. CPUs have special optimizations for sequential data.

Here's where data-oriented design comes in: instead of a HashMap, store keys in one vector and values in another:

#[pyclass]
struct RedDict {
    index: Arc<HashMap<String, usize>>,  // key -> position
    values: Arc<Vec<f64>>,                // packed values
}

Now operations become simple vector operations. No lookups, just sequential access.

The performance gains are dramatic:

| Library | 100K × 5 ops (10 items) | 100K × 5 ops (1000 items) | |---------|--------------------------|----------------------------| | NumPy | 0.285s | 0.371s | | Redbear | 0.065s | 0.160s |

Redbear beats NumPy. On small data, it's nearly 4x faster.

The Secret Weapon: Shared Indexes

The final optimization is elegant. When you chain operations like rd.add(other).multiply(10), the dictionaries share the same key order. If the indexes match, you can skip lookups entirely:

fn add(&self, other: Self, fill: f64) -> Self {
    let mut new = self.clone();
    let new_vals = Arc::make_mut(&mut new.values);
    
    if new.index == other.index {
        // Fast path: same index, no lookups needed
        for (lhs, rhs) in new_vals.iter_mut().zip(other.values.iter()) {
            *lhs += rhs;
        }
    } else {
        // Slow path: need lookups
        for (key, &i) in self.index.iter() {
            let rhs = other.index.get(key).map(|&j| other.values[j]).unwrap_or(fill);
            new_vals[i] += rhs;
        }
    }
    new
}

This is the kind of optimization that only works when you control the data layout. NumPy can't do this because it doesn't maintain index semantics.

What This Means for You

If you need NumPy-level performance but with Python dict semantics, you now have a path:

Don't just wrap functions — create custom data structures
Convert once — batch your operations on the Rust side
Think in vectors — data-oriented design beats object-oriented for numerical work
Track your indexes — when operations share the same keys, you can skip work

The tradeoff is real: Redbear only works with floats, and you need to reuse derived instances for best performance. But for the right use case, the gains are worth it.

The bigger lesson is that Rust-Python interop isn't just about rewriting slow functions. It's about rethinking where your data lives and how it moves. NumPy won because it moved the data once and operated many times. Now you can too — with a Python-native API.

This post was inspired by research from This Week in Rust issue 641 and the Redbear library by Artem Chernyak.