Building a GPT from Scratch in Rust

What if you could build a language model from scratch — not just use one, but actually understand how the whole thing works under the hood?

That's exactly what I set out to do. I took Andrej Karpathy's nanoGPT, the legendary "clean, readable" implementation of GPT, and rebuilt it in Rust using the Burn deep learning framework.

Here's what I learned.

Why Rust for Deep Learning?

You might be thinking: "Rust? For neural networks? Isn't that what Python is for?"

It's a fair question. Python dominates ML for good reason — PyTorch, TensorFlow, JAX. The ecosystem is massive. But Rust brings something different to the table:

Fearless refactoring — The type system catches bugs at compile time. No runtime surprises when you rename a layer.
Zero-cost abstractions — Burn compiles down to optimized native code. No Python interpreter overhead.
Cross-platform by default — The same code runs on CPU, CUDA, and Metal via WGPU. No vendor lock-in.
Memory safety without garbage collection — No unexpected pauses from GC. Real-time inference is viable.

The creator of Burn framed it as solving the "impossible triangle" of accelerated computing: you can only have 2 of 3 (performance, portability, flexibility). Python gives you performance + flexibility but sacrifices portability (CUDA lock-in). Burn aims to deliver all three.

The NanoGPT Challenge

nanoGPT is famously minimal — about 300 lines of Python for the core model. It's the perfect starting point because:

It's readable. You can actually understand what's happening.
It's complete. It trains on Shakespeare and produces coherent text.
It's iconic. Every ML practitioner has studied it.

Porting it to Rust meant translating:

The transformer architecture (multi-head attention, feed-forward networks, positional embeddings)
The training loop (backpropagation, optimizer updates, loss computation)
The tokenization (Byte Pair Encoding)

What Burn Brings to the Table

Burn handles the hard parts:

// Defining a transformer block in Burn
impl Module for TransformerBlock {
    fn forward(&self, x: Tensor<B, 3>) -> Tensor<B, 3> {
        let x = x + self.attention.forward(self.norm1.forward(x));
        x + self.ffn.forward(self.norm2.forward(x))
    }
}

The Module trait is Burn's core abstraction — every neural network layer implements it. The forward method is your computation graph. Under the hood, Burn handles:

Autodiff — Automatic differentiation via the autodiff backend wrapper
Tensor operations — Matrix multiplication, softmax, layernorm — all optimized
Backend abstraction — Swap between ndarray (CPU), Candle (GPU), or WGPU (cross-platform)

The Good, The Hard, and The Unexpected

What worked well

Static typing caught bugs early — Mismatched tensor shapes caught at compile time, not runtime.
The borrow checker forced clarity — No accidental mutations. The data flow was explicit.
Performance out of the box — Burn's compiled runtime was competitive with PyTorch.

What was harder than expected

Debugging tensor shapes — When things don't match, the error messages can be cryptic.
Learning Burn's API — It's younger than PyTorch. Documentation is solid but fewer tutorials.
Optimizer implementations — Adamw in Rust required careful porting from Python.

The unexpected insight

The real value wasn't the model itself — it was understanding. Reading through nanoGPT's Python, I could follow the logic. But implementing it in Rust forced me to understand every operation at a deeper level. The borrow checker doesn't care about your high-level intentions. You have to be precise.

Why This Matters

We're entering an era where inference matters as much as training. The ability to run language models efficiently on edge devices, in browsers, or in constrained environments — that's where Rust shines.

Frameworks like Burn, Candle, and Ruff (Python linter rewritten in Rust) are showing that Rust isn't just a systems language. It's becoming the infrastructure layer for numerical computing.

And for someone learning Rust like me? Building a GPT from scratch is the ultimate learning project. It's hard. It's frustrating. And when it works, you understand not just the model — you understand the platform.

This post was inspired by the Burn framework and various nanoGPT Rust ports in the ecosystem. The goal wasn't to beat PyTorch — it was to learn.