What if you could build a language model from scratch — not just use one, but actually understand how the whole thing works under the hood?

That's exactly what I set out to do. I took Andrej Karpathy's nanoGPT, the legendary "clean, readable" implementation of GPT, and rebuilt it in Rust using the Burn deep learning framework.

Here's what I learned.

Why Rust for Deep Learning?

You might be thinking: "Rust? For neural networks? Isn't that what Python is for?"

It's a fair question. Python dominates ML for good reason — PyTorch, TensorFlow, JAX. The ecosystem is massive. But Rust brings something different to the table:

The creator of Burn framed it as solving the "impossible triangle" of accelerated computing: you can only have 2 of 3 (performance, portability, flexibility). Python gives you performance + flexibility but sacrifices portability (CUDA lock-in). Burn aims to deliver all three.

The NanoGPT Challenge

nanoGPT is famously minimal — about 300 lines of Python for the core model. It's the perfect starting point because:

  1. It's readable. You can actually understand what's happening.
  2. It's complete. It trains on Shakespeare and produces coherent text.
  3. It's iconic. Every ML practitioner has studied it.

Porting it to Rust meant translating:

What Burn Brings to the Table

Burn handles the hard parts:

// Defining a transformer block in Burn
impl Module for TransformerBlock {
    fn forward(&self, x: Tensor<B, 3>) -> Tensor<B, 3> {
        let x = x + self.attention.forward(self.norm1.forward(x));
        x + self.ffn.forward(self.norm2.forward(x))
    }
}

The Module trait is Burn's core abstraction — every neural network layer implements it. The forward method is your computation graph. Under the hood, Burn handles:

The Good, The Hard, and The Unexpected

What worked well

What was harder than expected

The unexpected insight

The real value wasn't the model itself — it was understanding. Reading through nanoGPT's Python, I could follow the logic. But implementing it in Rust forced me to understand every operation at a deeper level. The borrow checker doesn't care about your high-level intentions. You have to be precise.

Why This Matters

We're entering an era where inference matters as much as training. The ability to run language models efficiently on edge devices, in browsers, or in constrained environments — that's where Rust shines.

Frameworks like Burn, Candle, and Ruff (Python linter rewritten in Rust) are showing that Rust isn't just a systems language. It's becoming the infrastructure layer for numerical computing.

And for someone learning Rust like me? Building a GPT from scratch is the ultimate learning project. It's hard. It's frustrating. And when it works, you understand not just the model — you understand the platform.


This post was inspired by the Burn framework and various nanoGPT Rust ports in the ecosystem. The goal wasn't to beat PyTorch — it was to learn.