From Python to Rust: Building GPT from Scratch in Pure Rust

When Andrej Karpathy releases a new GPT implementation, the ML community pays attention. When Rust developers start porting those implementations, something more interesting happens: a debate about whether Rust's safety guarantees belong in ML at all.

Over the past year, three significant ports have emerged, each taking a different approach to translating Python ML code into Rust. Together, they tell a story about what Rust brings to machine learning — and what it struggles with.

The Original: nanogpt

Karpathy's nanogpt is the baseline. Originally a character-level GPT model, it became famous for being readable — under 1000 lines of Python that actually trains. The code is pedagogical: you can read it, understand the attention mechanism, see exactly where the matrix multiplications happen.

The project supports BPE (Byte Pair Encoding) tokenization when you need it, and the training loop is deliberately minimal. It's designed to be modified, extended, understood.

This is what Rust developers wanted to port.

rust-microgpt: The Unsafe Speed Play

The first serious port came from mplekh/rust-microgpt, and it's the most opinionated of the three.

Here's what makes it interesting: it doesn't use any ML framework. No tch, no candle, no burn. Just pure Rust with a custom autograd engine.

The approach is eye-opening:

Tape-based autodiff: Forward pass records operations to a "tape" (contiguous memory block). Backward pass traverses in reverse, computing gradients.
Unsafe raw pointers: They bypass Rust's bounds checking in the hot backpropagation loop using unsafe and raw pointer arithmetic.
Pre-allocated memory arena: No heap allocations during training — everything comes from a pre-allocated buffer.
Custom MT19937 RNG: A full implementation of Mersenne Twister that replicates Python's random module exactly, because reproducibility matters.

The result: "matching C++ performance" according to the README. That's a bold claim, but the technique is sound. When you're in the inner loop of backpropagation, bounds checking is overhead you don't need.

This is the Rust ML community's answer to a hard question: do you want safety, or do you want performance? rust-microgpt chooses performance, with an explicit unsafe boundary around the gradient computation.

What About BPE?

Character-level models are great for learning, bad for practical text. The real question is whether anyone has built BPE tokenization in pure Rust.

The short answer: not in these ports. Most Rust GPT implementations either:

Use Python tokenizers via pyo3 bindings
Stick with character-level for simplicity
Use an external crate like tokenizers (which is itself a Rust binding to HuggingFace's Rust, which is a binding to...)

This is a gap in the ecosystem. BPE in pure Rust is still unsolved in a satisfying way.

nanochat: The Full Stack

Karpathy's latest (October 2025) is nanochat — and it's a bigger leap. Where nanogpt was pretraining-only, nanochat adds:

Fine-tuning capability
Chat/inference interface
Single-GPU training harness
Integration with code execution (Python in special tokens)

The "$100 LLM" concept — train your own ChatGPT on consumer hardware — made it viral. No Rust port exists yet, but it's the obvious next target.

What These Ports Teach Us

Rust can do ML without ML frameworks. The rust-microgpt autograd engine is genuinely impressive. It's not PyTorch — it's 500 lines of code that implements differentiation from scratch. That's the "from first principles" ethos Rust developers love.

The safety/speed tradeoff is real. rust-microgpt uses unsafe not because the developers are reckless, but because they made a deliberate engineering choice: the autograd kernel is a well-defined boundary. Everything outside it is safe. Inside, they needed C++ speed.

Ecosystem gaps matter. BPE tokenization, pretrained weight loading, dataset utilities — these are unsexy but necessary. A beautiful autograd engine doesn't matter if you can't easily tokenize your data.

The Deeper Question

What's the point of building GPT in Rust when PyTorch exists?

The same point as building anything in Rust: control, predictability, deployment simplicity. No Python runtime. No GIL. Single binary deployment. Memory safety guarantees even in the inference path.

But there's a subtler reason too. When you build a neural network from scratch — in any language — you understand it differently. The Rust implementations force you to be explicit about memory layout, allocation strategy, numerical stability. Python hides these details. Rust makes them visible.

That's the real value: not "Rust GPT that's faster than PyTorch," but "Rust GPT that teaches you things PyTorch hid."

What's Next

The obvious next step is nanochat in Rust. Full-stack training + inference + chat interface. Someone will do it. The question is whether they'll use unsafe like rust-microgpt, or take the slower-but-safer path.

Either way, the pattern is clear: Python invents, Rust translates, the translation teaches.

This post is part of my ongoing exploration of Rust in machine learning. Related: Running Local LLMs in Rust, The Rise of Numr: Rust's Answer to NumPy?.