_SH Log's
Back to Root
EST: 5 min read

LLM Cost Optimization: Cut Your AI API Bills 10x

LLM API costs can spiral fast. Here are the techniques I use to keep costs under control across six AI products — from model tiering to prompt caching to batching.

#llm#cost#ai#optimization

LLM API costs are one of the top concerns for AI product builders. At scale, naive usage patterns cost 10-100× more than necessary. Here are the techniques I use across my products to keep costs reasonable without sacrificing quality.

The baseline problem

A naive AI product calls the most capable model for every request, passes the full conversation history every time, and makes synchronous calls one-by-one. This is:

  • Expensive: Claude Sonnet at $3/1M input + $15/1M output is 100× more expensive than Haiku at $0.25/$1.25
  • Slow: sequential calls stack latency
  • Wasteful: 90% of requests don't need the most capable model

Technique 1: Model tiering

Route requests to the cheapest model that can handle them:

type ModelRouter struct {
    tiers []ModelTier
}

type ModelTier struct {
    Model     string
    MaxTokens int
    UseFor    []TaskType
}

var DefaultTiers = []ModelTier{
    {
        Model:  "claude-haiku-4-5",
        UseFor: []TaskType{IntentClassify, SimpleAnswer, Summarize},
    },
    {
        Model:  "claude-sonnet-4-6",
        UseFor: []TaskType{CodeGeneration, ComplexReasoning, LongForm},
    },
    {
        Model:  "claude-opus-4-8",
        UseFor: []TaskType{ArchitectureDesign, NuancedJudgment},
    },
}

func (r *ModelRouter) Route(task Task) string {
    for _, tier := range r.tiers {
        for _, taskType := range tier.UseFor {
            if task.Type == taskType {
                return tier.Model
            }
        }
    }
    return "claude-haiku-4-5" // default to cheapest
}

In BikroyBuddy: 75% of requests go to Haiku (intent classification, simple replies), 25% to Sonnet (negotiation). This alone reduces costs by ~70% vs all-Sonnet.

Technique 2: Prompt caching

Claude, OpenAI, and most major providers support prompt caching — the system prompt is cached server-side and not billed on subsequent requests (or billed at a heavily discounted rate).

# Claude API with prompt caching
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1000,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,  # 2000 tokens
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": user_message}]
)

With caching: the 2000-token system prompt is charged at write-time only. On cache hit: charged at 10% of normal input price. For a chatbot with 100 messages/day and a 2000-token system prompt:

  • Without caching: 100 × 2000 × $3/1M = $0.60/day
  • With caching (90% hit rate): $0.06/day

Save: 90% on system prompt tokens.

Technique 3: Free models for non-critical paths

Via OpenRouter, several capable open-source models are available for free (rate-limited):

| Model | OpenRouter free? | Use for | |-------|-----------------|---------| | Llama 3.1 70B | ✅ | Summarization, classification | | Mistral 7B | ✅ | Simple Q&A, extraction | | Gemma 2 9B | ✅ | Lightweight tasks | | Claude Haiku | ❌ | Paid but very cheap | | Claude Sonnet | ❌ | Paid, for complex work |

from openai import OpenAI

# Route to free model for batch processing
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=OPENROUTER_KEY
)

response = client.chat.completions.create(
    model="meta-llama/llama-3.1-70b-instruct:free",
    messages=[{"role": "user", "content": classify_prompt}]
)

I use free models for: daily batch jobs, internal tools, non-customer-facing processing, and research tasks. Customer-facing features use paid models.

Technique 4: Batching

Many LLM providers (including Anthropic) offer batch APIs with 50% discounts for non-real-time processing:

# Anthropic Batch API — process many messages at once
import anthropic

client = anthropic.Anthropic()

batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"product_{i}",
            "params": {
                "model": "claude-haiku-4-5",
                "max_tokens": 100,
                "messages": [{"role": "user", "content": f"Classify: {product}"}]
            }
        }
        for i, product in enumerate(product_list)
    ]
)

# Poll for completion (typically < 1 hour)
while batch.processing_status == "in_progress":
    time.sleep(60)
    batch = client.messages.batches.retrieve(batch.id)

results = client.messages.batches.results(batch.id)

50% discount for batch processing means nightly batch jobs (product classification, report generation, analytics) cost half of real-time API calls.

Technique 5: Context window management

Long conversation histories are expensive — every message in history is billed as input tokens. Strategies:

Sliding window: keep only the last N messages:

def trim_history(history: list, max_messages: int = 10) -> list:
    if len(history) > max_messages:
        # Keep system context + last N messages
        return history[:1] + history[-max_messages:]
    return history

Summarization: periodically summarize old messages:

async def summarize_old_context(history: list, threshold: int = 20) -> list:
    if len(history) > threshold:
        old_messages = history[1:threshold//2]  # skip system message
        summary = await llm.complete(f"Summarize these messages in 100 words:\n{format(old_messages)}")
        return [history[0], {"role": "assistant", "content": f"[Previous summary]: {summary}"}] + history[threshold//2:]
    return history

Cost dashboard

I track per-product LLM costs in a simple PostgreSQL table:

CREATE TABLE llm_cost_log (
    id          BIGSERIAL PRIMARY KEY,
    product     TEXT,
    model       TEXT,
    task_type   TEXT,
    input_tokens  INT,
    output_tokens INT,
    cost_usd    NUMERIC(10, 6),
    created_at  TIMESTAMPTZ DEFAULT now()
);

Daily aggregate query shows where costs are concentrated. In BikroyBuddy, this revealed that 15% of negotiation sessions were consuming 60% of Sonnet spend — users who never converted. Added a "low intent" classifier that routes obvious non-buyers to Haiku instead.

FAQ

What is LLM prompt caching? Prompt caching stores frequently-used prompt prefixes (like system prompts) server-side so they don't need to be re-processed on every request. Claude and OpenAI both support it. On cache hit, input tokens are charged at a discount (typically 10% of normal price).

How much cheaper is Claude Haiku than Claude Sonnet? Haiku 4.5 costs $0.25/1M input, $1.25/1M output. Sonnet 4.6 costs $3/1M input, $15/1M output. Sonnet is 12× more expensive for input, 12× for output. Use Haiku for any task where quality difference isn't user-visible.

What's the Anthropic Batch API? The Anthropic Batch API processes large numbers of requests asynchronously with a 50% discount vs real-time API calls. Results are available within 24 hours (typically much faster). Ideal for nightly processing jobs, data enrichment, and any non-real-time AI task.

Should I self-host an open-source LLM to save money? Self-hosting pays off above ~50M tokens/day for 8B models. Below that, hosted APIs (especially OpenRouter free tier) are cheaper after factoring in GPU instance costs and operational overhead.


Written by Shihab Shahriar Antor — AI Engineer & Founder of Shahriar Labs. See also: Self-Hosting LLMs: Llama 3, Mistral on Your Server · Deploy Always-On AI Agents on AWS for ~$17/mo.