The Economics of API Calls
Every line of AI code costs money. Not in infrastructure — in tokens. Here's how to think about it.
The Numbers
| Provider/Model | Input ($/M) | Output ($/M) | Context Window | |----------------|-------------|--------------|----------------| | MiniMax M2.5 | $0.30 | $1.20 | 1M tokens | | MiniMax M2.5-highspeed | $0.60 | $2.40 | 1M tokens | | GPT-4o mini | Cheap | Cheap | 128K | | Claude 3.5 | Premium | Premium | 200K | | Mistral (Anyscale) | $0.15 | $0.15 | 128K |
Output tokens are always more expensive than input. That's not going to change.
The Math
A typical agentic turn — system prompt, tool definitions, conversation history, tool output, response — easily burns 3,000-10,000 tokens.
Say 5,000 tokens per turn at MiniMax M2.5 rates:
- Input: 4,000 × $0.30/M = $0.0012
- Output: 1,000 × $1.20/M = $0.0012
$0.0024 per turn. Sounds tiny. But a 50-turn conversation? $0.12. A thousand conversations a day? $120/month.
Scale is the killer.
The Leverage Points
1. Model Routing
Not every task needs GPT-4o. Simple classification, extraction, formatting — small models handle these for 10-20x less.
Route by task complexity:
- Small (cheap): classification, formatting, simple extraction
- Medium: summarization, rewriting, Q&A over docs
- Large (expensive): reasoning chains, planning, complex tool orchestration
2. Context Window Management
Every token in context costs money. Strategies:
- Truncate aggressively. Keep only the last N messages or summarize older ones.
- Separate systems. Don't repeat tool definitions in every message — reference them once.
- Tool output pruning. The model doesn't need full JSON back from every tool. Extract just what matters.
3. Caching
If you're calling the same prompts repeatedly, cache the results. Redis, SQLite, in-memory — doesn't matter. Hit the API less, save more.
4. Go Local
llama.cpp runs 7B-14B models on consumer hardware. For simple tasks, local inference is free after hardware cost.
Tradeoff: latency. Local is slower, but for background jobs it doesn't matter.
What This Means for Agents
The agent pattern — loop of thinking, acting, observing — is token-intensive by design. Every iteration adds context, and context = cost.
Three ways to survive:
- Keep loops short. Max 3-5 iterations. Give up if it hasn't worked.
- Fail fast. If the model can't solve it in 2 tries, escalate or return partial results.
- Route ruthlessly. Small tasks to small models. Save the big model for when reasoning actually matters.
The Bottom Line
The real cost of AI isn't the model — it's the context you build up around it. Every conversation, every tool definition, every retry. Be intentional about what stays in the prompt and what gets dropped.
The difference between a $500/month agent and a $5,000/month agent is often just 3x fewer tokens, not a better model.