Every time you send a prompt to ChatGPT or Claude, you're making a trade-off. You're trading privacy for convenience, latency for capability, and control for simplicity. But what if you didn't have to?
Running large language models locally has moved from "interesting experiment" to "viable production option" in the past year. The Rust ecosystem has caught up, and today I'm going to show you how to run local LLMs in your own Rust code using llama.cpp bindings.
Why Run Locally?
Before we dive into code, let's talk about why you'd want to do this:
Privacy. Your prompts never leave your machine. No third-party servers, no data retention policies, no unexpected API logs. This matters for enterprise work, medical data, legal documentsâanything sensitive.
Latency. Round-trip to an API is 500ms minimum, often more. Local inference can be under 100ms for smaller models on decent hardware. That changes what you can build.
Cost. API calls add up. Once you've paid for GPU hardware (or are using your existing machine), inference is free. Unlimited queries, no rate limits.
Control. Want a specific model? Want to fine-tune? Want to run the same model offline on a laptop in a cabin? Go for it.
The Stack: llama.cpp + Rust
llama.cpp is the gold standard for efficient local LLM inference. Written in C/C++, it supports GPU acceleration (CUDA, Metal, Vulkan), quantization, and a wide range of model formats.
The Rust bindings come in two flavors:
- llama-cpp-sys â low-level FFI bindings
- llama-cpp â high-level, safe Rust API
We'll use the high-level API.
Setting It Up
Add the dependency:
[dependencies]
llama-cpp = "0.2"
You'll also need a model. The easiest way to get started is with a quantized GGUF file from Hugging Face. For local inference, look for models in Q4_K_M or Q5_K_S quantizationâthey're small (3-5GB) but still capable.
Your First Local Inference
Here's a complete example that loads a model and generates text:
use llama_cpp::LlamaPipeline;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Initialize the pipeline with your model
let mut pipeline = LlamaPipeline::from_file("./models/phi-2-q4_k_m.gguf")?;
// Create a prompt
let prompt = "Write a Rust function that reverses a string:";
// Generate!
let result = pipeline.generate(
prompt,
llama_cpp::GenerationSettings::default()
.with_max_tokens(200)
.with_temperature(0.7),
);
println!("{}", result);
Ok(())
}
That's it. Load a GGUF file, call generate(), get text back.
Making It Interactive
But let's be honestâbatch generation isn't the interesting part. The interesting part is building an interactive agent that can:
- Load a system prompt (instructions for the model)
- Maintain conversation history
- Actually use tools
Here's a more realistic structure:
use llama_cpp::{LlamaPipeline, ChatMessage};
struct LocalAgent {
pipeline: LlamaPipeline,
system_prompt: String,
conversation: Vec<ChatMessage>,
}
impl LocalAgent {
fn new(model_path: &str, system_prompt: &str) -> Result<Self, Box<dyn std::error::Error>> {
let mut pipeline = LlamaPipeline::from_file(model_path)?;
Ok(Self {
pipeline,
system_prompt: system_prompt.to_string(),
conversation: vec![],
})
}
fn chat(&mut self, user_message: &str) -> Result<String, Box<dyn std::error::Error>> {
// Add user message to history
self.conversation.push(ChatMessage::user(user_message));
// Generate with conversation context
let response = self.pipeline.chat(
&self.conversation,
llama_cpp::GenerationSettings::default()
.with_max_tokens(500)
.with_temperature(0.7),
)?;
// Add assistant response to history
self.conversation.push(ChatMessage::assistant(&response));
Ok(response)
}
}
Now you have a stateful chat agent. Add a tool-calling layer on top (parse JSON from the model, execute functions, feed results back) and you've got an agent.
The Catch (Because There's Always One)
Running LLMs locally isn't all sunshine:
Hardware requirements. You need RAM. Lots of it. A Q4 quantized model needs roughly 4GB RAM just to load. For decent speed, add more. GPU helps but isn't strictly required for small models.
Model capability. A 3B parameter model (quantized) is not going to match GPT-4. It's closer to GPT-3.5 or even GPT-3. For complex reasoning, you still need the cloud.
Token speed. Even with Metal acceleration on Apple Silicon, you're looking at 20-40 tokens/second for a 3B model. Cloud APIs are faster for large models.
Context windows. Most local models max out at 4K-8K context. Some newer ones support 32K, but they eat RAM.
When It Makes Sense
Local inference shines in specific scenarios:
- Tools and agents that need fast, repetitive calls (classifying, extracting, formatting)
- Offline-first applications (air-gapped systems, field work)
- Privacy-critical workloads (healthcare, legal, financial)
- Experimentation and development (iterate without burning API credits)
What's Next
The Rust ecosystem for local AI is maturing fast. Beyond llama.cpp, you've got:
- candle â Meta's Rust ML framework, supports transformers
- rust-transformers â Hugging Face's Rust port (still early)
- tiktoken â Fast tokenization
And the models keep getting better. Mistral, Phi, Qwenâall available in GGUF, all runnable locally.
The future isn't "local OR cloud." It's both. The best agents will route simple tasks to local models (fast, cheap, private) and escalate to cloud for complex reasoning. That's the architecture I'm building for ZeroClawâand now you can build it too.