9 min read

    How to Cut Your LLM API Bill by 60%: Techniques That Actually Work

    by Deep Parmar

    CTO, Sunbots & Xwits

    Cut LLM API Costs 60%: Production Techniques | Deep Parmar

    LLM API costs have a pattern: they start small and feel manageable, then scale faster than the revenue they generate. A proof-of-concept that costs Rs.5,000 per month becomes a production workload that costs Rs.80,000 per month after user acquisition — sometimes before the product is profitable enough to absorb that cost. Running Marketing Autopilot, XwFin, and several other AI-heavy products at Xwits pushed us to take cost optimisation seriously. Here is what actually moved the number.

    Why LLM Costs Spiral (And the Mental Model to Fix Them)

    Most LLM cost problems come from three root causes: using powerful models for tasks that do not require them, sending the same context repeatedly without caching, and generating more output tokens than necessary. Every cost optimisation technique is an attack on one of these three problems. Know which problem you have before choosing a technique.

    The Six Techniques That Actually Work

    1. Prompt caching — If your system prompt is large (common with RAG setups or detailed instructions), prompt caching can reduce costs by 50-90% on the system prompt portion. Anthropic and OpenAI both offer caching. Cache your system prompt and any large static context that stays the same across many calls. This is the highest-ROI change for most applications.

    2. Model routing — Not every query needs your most powerful model. A router that classifies queries by complexity and routes simple ones to a smaller model (GPT-4o-mini, Gemma, Mistral Small) and complex ones to the frontier model cuts costs significantly. We reduced average cost per call by 40% on Marketing Autopilot by routing classification and short-form generation tasks to smaller models. The key: measure quality drop, not just cost reduction. Some tasks tolerate quality reduction; others do not.

    3. Semantic caching — Cache LLM responses and return cached results for semantically similar queries. If ten users ask essentially the same question about a product, you should compute the answer once, not ten times. Semantic caching requires a vector store to find similar past queries, but the cost reduction on high-traffic, question-answering workloads is substantial.

    4. Output length control — LLM APIs charge for output tokens, not just input. Explicit instructions to keep responses concise, combined with max-token limits set below the model default, reduce costs on verbose models. We added explicit length instructions to 80% of our production prompts and reduced average output tokens by 30% with no material quality impact.

    5. Batching — For offline or non-real-time workloads, batch API calls rather than making individual requests. Most providers offer batch APIs at 50% of standard pricing. Marketing Autopilot's content generation pipeline moved to batch processing for non-time-sensitive content and halved that workload's cost immediately.

    6. Prompt compression — Long prompts cost money. Compress verbose prompts by removing redundant instructions, using concise phrasing, and moving example-heavy few-shot prompts to fine-tuned models where call volume justifies it. A 30% reduction in input tokens on a high-volume pipeline is meaningful at scale.

    How to Track and Audit LLM Spend

    You cannot optimise what you do not measure. Instrument every LLM call with: model name, input token count, output token count, latency, feature or endpoint tag, and estimated cost. Build a dashboard that shows cost by feature. The first time you see cost attribution by feature, you will immediately identify which ones have cost structures that do not match their revenue contribution. That identification is the starting point for targeted optimisation.

    Frequently Asked Questions

    Quick answers about this topic — also indexed by AI search engines via FAQPage schema.

    Share this article: