Every time you send a prompt to ChatGPT or Claude, you're making a trade-off. You're trading privacy for convenience, latency for capability, and control for simplicity. But what if you didn't have to?

Running large language models locally has moved from "interesting experiment" to "viable production option" in the past year. The Rust ecosystem has caught up, and today I'm going to show you how to run local LLMs in your own Rust code using llama.cpp bindings.

Why Run Locally?

Before we dive into code, let's talk about why you'd want to do this:

Privacy. Your prompts never leave your machine. No third-party servers, no data retention policies, no unexpected API logs. This matters for enterprise work, medical data, legal documents—anything sensitive.

Latency. Round-trip to an API is 500ms minimum, often more. Local inference can be under 100ms for smaller models on decent hardware. That changes what you can build.

Cost. API calls add up. Once you've paid for GPU hardware (or are using your existing machine), inference is free. Unlimited queries, no rate limits.

Control. Want a specific model? Want to fine-tune? Want to run the same model offline on a laptop in a cabin? Go for it.

The Stack: llama.cpp + Rust

llama.cpp is the gold standard for efficient local LLM inference. Written in C/C++, it supports GPU acceleration (CUDA, Metal, Vulkan), quantization, and a wide range of model formats.

The Rust bindings come in two flavors:

We'll use the high-level API.

Setting It Up

Add the dependency:

[dependencies]
llama-cpp = "0.2"

You'll also need a model. The easiest way to get started is with a quantized GGUF file from Hugging Face. For local inference, look for models in Q4_K_M or Q5_K_S quantization—they're small (3-5GB) but still capable.

Your First Local Inference

Here's a complete example that loads a model and generates text:

use llama_cpp::LlamaPipeline;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Initialize the pipeline with your model
    let mut pipeline = LlamaPipeline::from_file("./models/phi-2-q4_k_m.gguf")?;
    
    // Create a prompt
    let prompt = "Write a Rust function that reverses a string:";
    
    // Generate! 
    let result = pipeline.generate(
        prompt,
        llama_cpp::GenerationSettings::default()
            .with_max_tokens(200)
            .with_temperature(0.7),
    );
    
    println!("{}", result);
    Ok(())
}

That's it. Load a GGUF file, call generate(), get text back.

Making It Interactive

But let's be honest—batch generation isn't the interesting part. The interesting part is building an interactive agent that can:

  1. Load a system prompt (instructions for the model)
  2. Maintain conversation history
  3. Actually use tools

Here's a more realistic structure:

use llama_cpp::{LlamaPipeline, ChatMessage};

struct LocalAgent {
    pipeline: LlamaPipeline,
    system_prompt: String,
    conversation: Vec<ChatMessage>,
}

impl LocalAgent {
    fn new(model_path: &str, system_prompt: &str) -> Result<Self, Box<dyn std::error::Error>> {
        let mut pipeline = LlamaPipeline::from_file(model_path)?;
        
        Ok(Self {
            pipeline,
            system_prompt: system_prompt.to_string(),
            conversation: vec![],
        })
    }
    
    fn chat(&mut self, user_message: &str) -> Result<String, Box<dyn std::error::Error>> {
        // Add user message to history
        self.conversation.push(ChatMessage::user(user_message));
        
        // Generate with conversation context
        let response = self.pipeline.chat(
            &self.conversation,
            llama_cpp::GenerationSettings::default()
                .with_max_tokens(500)
                .with_temperature(0.7),
        )?;
        
        // Add assistant response to history
        self.conversation.push(ChatMessage::assistant(&response));
        
        Ok(response)
    }
}

Now you have a stateful chat agent. Add a tool-calling layer on top (parse JSON from the model, execute functions, feed results back) and you've got an agent.

The Catch (Because There's Always One)

Running LLMs locally isn't all sunshine:

Hardware requirements. You need RAM. Lots of it. A Q4 quantized model needs roughly 4GB RAM just to load. For decent speed, add more. GPU helps but isn't strictly required for small models.

Model capability. A 3B parameter model (quantized) is not going to match GPT-4. It's closer to GPT-3.5 or even GPT-3. For complex reasoning, you still need the cloud.

Token speed. Even with Metal acceleration on Apple Silicon, you're looking at 20-40 tokens/second for a 3B model. Cloud APIs are faster for large models.

Context windows. Most local models max out at 4K-8K context. Some newer ones support 32K, but they eat RAM.

When It Makes Sense

Local inference shines in specific scenarios:

What's Next

The Rust ecosystem for local AI is maturing fast. Beyond llama.cpp, you've got:

And the models keep getting better. Mistral, Phi, Qwen—all available in GGUF, all runnable locally.


The future isn't "local OR cloud." It's both. The best agents will route simple tasks to local models (fast, cheap, private) and escalate to cloud for complex reasoning. That's the architecture I'm building for ZeroClaw—and now you can build it too.