You've built a nice agent. It can call tools. It can reason. And then you switch from GPT-4 to a local model—something like Qwen3.5 or Llama 3.3 running via llama.cpp—and everything falls apart.

The agent starts ignoring tools. It calls the wrong tool. It calls the right tool with garbage arguments. Or it returns valid JSON... wrapped in markdown code blocks that your parser chokes on.

This isn't your fault. Local models have fundamentally different failure modes than API providers. Let me show you what's actually happening.

The Three Layers of Tool-Calling Failure

When you hand a local LLM a tool definition, you're trusting it to do three things:

  1. Select the right tool (or decide to use no tool at all)
  2. Generate valid arguments that match your schema
  3. Output the result in a parseable format

API models like GPT-4 and Claude have been fine-tuned extensively on tool-calling data. Local models haven't. They're often trained on general chat data, then instruction-tuned on tool use—sometimes poorly.

Failure Pattern 1: The JSON Smuggler

The most common failure. The model generates valid JSON... then wraps it in markdown:

Here's the tool call:

```json
{"tool": "get_weather", "arguments": {"location": "Boston"}}

Your parser expects raw JSON. It gets a string with fences. Parsing fails.

This happens because the model learned to "helpfully" format its output. It's trying to be nice. It's not.

### Failure Pattern 2: The Thinking Token Polluter

Models with "thinking modes" or O1-style reasoning have a special failure mode. They generate internal reasoning tokens *before* the tool call, and sometimes those tokens leak into the JSON output:

```json
{"tool": "get_weather", "arguments": {"location": "Let me think about this... Boston has weather..."}}

The JSON is technically valid. The content is garbage. Your schema validation passes. Your tool receives nonsense.

Failure Pattern 3: The Schema Guesser

Your tool expects:

struct WeatherArgs {
    location: String,
    units: Option<String>,
}

The model generates:

{"location": "Boston", "type": "celsius"}

Close, but type isn't a valid field. Your parser rejects it. Or worse—it accepts it silently and your downstream code crashes when units is missing.

Failure Pattern 4: The Non-Decider

Some models simply refuse to call tools. They acknowledge your tool definitions, then respond with natural language instead:

"I don't have access to real-time weather data, but I can tell you that Boston in March is typically..."

The tool definitions were ignored. This is especially common with smaller models (under 8B parameters) that weren't trained on tool-use data.

The Retry Loop That Actually Works

Here's the thing: these failures are recoverable. You just need a retry strategy that handles each failure mode. Here's a Rust implementation:

use serde::{Deserialize, Serialize};

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ToolCall {
    pub tool: String,
    pub arguments: serde_json::Value,
}

pub enum ToolCallError {
    NoToolCalled,
    InvalidJson(String),
    SchemaMismatch(String),
    ParseError(String),
}

pub struct ToolCallRetryConfig {
    pub max_retries: u32,
    pub base_delay_ms: u64,
    pub max_delay_ms: u64,
}

impl Default for ToolCallRetryConfig {
    fn default() -> Self {
        Self {
            max_retries: 3,
            base_delay_ms: 100,
            max_delay_ms: 2000,
        }
    }
}

pub async fn call_with_retry<F, T>(
    prompt: &str,
    tools: &[serde_json::Value],
    mut executor: F,
    config: ToolCallRetryConfig,
) -> Result<ToolCall, ToolCallError>
where
    F: FnMut(String) -> T,
    T: std::future::Future<Output = Result<String, String>>,
{
    let mut last_error = None;
    let mut delay = config.base_delay_ms;

    for attempt in 0..config.max_retries {
        let response = executor(prompt.to_string()).await
            .map_err(|e| ToolCallError::ParseError(e))?;
        
        // Attempt 1: Try direct JSON parse
        match parse_tool_call(&response) {
            Ok(call) => return Ok(call),
            Err(e) => last_error = Some(e),
        }

        // Attempt 2: Strip markdown fences
        let stripped = strip_markdown_json(&response);
        if let Ok(call) = parse_tool_call(&stripped) {
            return Ok(call);
        }

        // Attempt 3: Extract JSON from anywhere in the response
        let extracted = extract_json_anywhere(&response);
        if let Ok(call) = parse_tool_call(&extracted) {
            return Ok(call);
        }

        // Retry with backoff
        if attempt < config.max_retries - 1 {
            tokio::time::sleep(tokio::time::Duration::from_millis(delay)).await;
            delay = (delay * 2).min(config.max_delay_ms);
        }
    }

    Err(last_error.unwrap_or(ToolCallError::NoToolCalled))
}

fn parse_tool_call(response: &str) -> Result<ToolCall, ToolCallError> {
    let trimmed = response.trim();
    
    // Try direct parse first
    if let Ok(call) = serde_json::from_str::<ToolCall>(trimmed) {
        return Ok(call);
    }

    Err(ToolCallError::InvalidJson(response.to_string()))
}

fn strip_markdown_json(input: &str) -> String {
    // Remove ```json and ``` fences
    let re = regex::Regex::new(r"(?s)```json\s*(.*?)\s*```").unwrap();
    let result = re.replace_all(input, "$1");
    
    // Also handle plain ``` fences
    let re2 = regex::Regex::new(r"(?s)```\s*(.*?)\s*```").unwrap();
    re2.replace_all(&result, "$1").to_string()
}

fn extract_json_anywhere(input: &str) -> String {
    // Find the first { and last } and extract everything between
    if let (Some(start), Some(end)) = (input.find('{'), input.rfind('}')) {
        input[start..=end].to_string()
    } else {
        input.to_string()
    }
}

Beyond Retries: Structured Tool Descriptions

The retry loop helps, but the real fix is making your tools easier for the model to use correctly.

Be Explicit About Types

Instead of:

{
  "name": "get_weather",
  "description": "Get weather info"
}

Use:

{
  "name": "get_weather",
  "description": "Returns current weather for a location. 
    Always call this when user asks about weather.",
  "parameters": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string", 
        "description": "City name, e.g. 'Boston' or 'London'"
      },
      "units": {
        "type": "string", 
        "enum": ["celsius", "fahrenheit"],
        "description": "Temperature unit, defaults to celsius"
      }
    },
    "required": ["location"]
  }
}

Use Enums for Ambiguous Fields

If a field has limited valid values, use enum:

"status": {
  "type": "string",
  "enum": ["pending", "active", "completed", "failed"],
  "description": "One of: pending, active, completed, failed"
}

This prevents the model from inventing values like "in progress" or "running".

Add Examples in the Description

"description": "Get weather for a location. 
Example: get_weather(location='Boston', units='fahrenheit')"

What Actually Works

After building agents with local LLMs for months, here's what I've learned:

  1. Smaller models (≤8B) struggle with tool selection. They often skip tools entirely. Use 14B+ models for reliable tool use.

  2. Qwen3.5 is good but has thinking token issues. If using O1-style modes, implement the JSON extraction layer above.

  3. Llama 3.3's function calling is improving via llama.cpp's new tool-calling support, but you still need the markdown stripper.

  4. Always validate schema after parsing. The model might output valid JSON that's still wrong for your tool.

  5. Temperature matters. For tool calling, keep temperature at 0.1-0.3. Higher temperatures increase creative (wrong) outputs.

The Bottom Line

Local LLMs aren't broken—they're just less polished. The same failure patterns that killed early chatbot demos (JSON parsing, schema guessing, output formatting) are alive and well in tool calling.

The good news: these are engineering problems, not fundamental limitations. With the right retry logic, validation layers, and tool descriptions, local models can absolutely drive agents reliably.

Your agent doesn't need GPT-4. It needs better JSON handling.