How can I reduce LLM API costs in production?

The six most effective techniques: prompt caching (for large, repeated system prompts), model routing (smaller models for simple tasks), semantic caching (reuse responses for similar queries), output length control, batch processing for offline workloads, and prompt compression.

What is prompt caching and how does it reduce costs?

Prompt caching stores previously computed representations of large, static prompt sections (like system prompts or reference documents) and reuses them across calls. Providers like Anthropic and OpenAI offer this at significantly reduced prices for cached tokens — typically 80-90% cheaper than uncached.

When should I use GPT-4o-mini instead of GPT-4o?

For classification tasks, short-form generation, simple extraction from structured data, and any task where you have measured that quality difference is acceptable. Always benchmark on your specific task rather than assuming the smaller model is sufficient.

How does model routing reduce LLM costs?

A classifier determines the complexity of each query and routes it to the most cost-appropriate model. Simple queries go to small models (low cost), complex queries go to frontier models (higher cost). Effective routing can reduce average cost per call by 30-50% while maintaining overall quality.

What is the cheapest way to run LLMs in production?

The cheapest production approach: fine-tuned small open-weight models self-hosted on cloud GPUs, with semantic caching to avoid redundant calls, and batch processing for offline workloads. This requires infrastructure investment but produces the lowest per-call cost at scale.

Cut LLM API Costs 60%: Production Techniques

LLM API costs have a pattern: they start small and feel manageable, then scale faster than the revenue they generate. A proof-of-concept that costs Rs.5,000 per month becomes a production workload that costs Rs.80,000 per month after user acquisition — sometimes before the product is profitable enough to absorb that cost. Running Marketing Autopilot, XwFin, and several other AI-heavy products at Xwits pushed us to take cost optimisation seriously. Here is what actually moved the number.

Why LLM Costs Spiral (And the Mental Model to Fix Them)

Most LLM cost problems come from three root causes: using powerful models for tasks that do not require them, sending the same context repeatedly without caching, and generating more output tokens than necessary. Every cost optimisation technique is an attack on one of these three problems. Know which problem you have before choosing a technique.

The Six Techniques That Actually Work

1. Prompt caching — If your system prompt is large (common with RAG setups or detailed instructions), prompt caching can reduce costs by 50-90% on the system prompt portion. Anthropic and OpenAI both offer caching. Cache your system prompt and any large static context that stays the same across many calls. This is the highest-ROI change for most applications.

2. Model routing — Not every query needs your most powerful model. A router that classifies queries by complexity and routes simple ones to a smaller model (GPT-4o-mini, Gemma, Mistral Small) and complex ones to the frontier model cuts costs significantly. We reduced average cost per call by 40% on Marketing Autopilot by routing classification and short-form generation tasks to smaller models. The key: measure quality drop, not just cost reduction. Some tasks tolerate quality reduction; others do not.

3. Semantic caching — Cache LLM responses and return cached results for semantically similar queries. If ten users ask essentially the same question about a product, you should compute the answer once, not ten times. Semantic caching requires a vector store to find similar past queries, but the cost reduction on high-traffic, question-answering workloads is substantial.

4. Output length control — LLM APIs charge for output tokens, not just input. Explicit instructions to keep responses concise, combined with max-token limits set below the model default, reduce costs on verbose models. We added explicit length instructions to 80% of our production prompts and reduced average output tokens by 30% with no material quality impact.

5. Batching — For offline or non-real-time workloads, batch API calls rather than making individual requests. Most providers offer batch APIs at 50% of standard pricing. Marketing Autopilot's content generation pipeline moved to batch processing for non-time-sensitive content and halved that workload's cost immediately.

6. Prompt compression — Long prompts cost money. Compress verbose prompts by removing redundant instructions, using concise phrasing, and moving example-heavy few-shot prompts to fine-tuned models where call volume justifies it. A 30% reduction in input tokens on a high-volume pipeline is meaningful at scale.

How to Track and Audit LLM Spend

You cannot optimise what you do not measure. Instrument every LLM call with: model name, input token count, output token count, latency, feature or endpoint tag, and estimated cost. Build a dashboard that shows cost by feature. The first time you see cost attribution by feature, you will immediately identify which ones have cost structures that do not match their revenue contribution. That identification is the starting point for targeted optimisation.

How to Cut Your LLM API Bill by 60%: Techniques That Actually Work

Why LLM Costs Spiral (And the Mental Model to Fix Them)

The Six Techniques That Actually Work

How to Track and Audit LLM Spend

Frequently Asked Questions

Related Posts

Small Models, Big Wins: When Phi-4 or Gemma Beats GPT-4 in Your Stack

The Gen-AI Stack I Use in Every Production Project