Technical · 2026-06-27 · Last verified 2026-06-27

Reduce AI Agent LLM Costs: The Production Optimization Playbook

AI agents cost 5-20x more per interaction than a plain chatbot. This playbook covers the seven cost levers that actually work in production: prompt caching, model routing, context trimming, batching, semantic caching, structured outputs, and token caps - with code and per-request math.

Deep · ML Architect & Full Stack Engineer

10+ years shipping production ML across TensorFlow, PyTorch, AWS, and GCP. Ships every A8gent agent before it becomes a lesson. GitHub

Key takeaways

Agents cost 5-20x more per interaction than a single-shot chatbot because every tool call re-sends the growing context. A 10-step agent loop can burn 100k+ input tokens to produce 3k output tokens.
Prompt caching is the highest-ROI lever: cached input tokens cost 0.1x the base rate on Anthropic (90% off) and 75-90% off on OpenAI. For agents, where 80-95% of each request is repeated prefix, this alone often cuts the bill 60-75%.
Model routing - sending easy requests to a $1/M model like Haiku 4.5 or GPT-5.4 Mini and escalating hard ones to a frontier model - typically halves costs again, because 60-80% of production agent traffic is easy.
Context hygiene compounds every other saving: trim tool outputs, summarize old turns, and cap max_tokens. An agent whose context grows unboundedly gets more expensive with every step of every conversation.
Measure per-request cost before optimizing anything. Log input tokens, output tokens, cache hits, and model per LLM call, and compute dollars per conversation. You cannot rank levers without this telemetry.
Batch APIs give a flat 50% discount on anything that does not need a real-time answer - evals, summarization jobs, enrichment pipelines - and the discount stacks with prompt caching.

Why Agents Cost 5-20x More Than a Chatbot

A chatbot answers one message with one LLM call. An agent runs a loop: think, call a tool, read the result, think again, maybe call another tool, then answer. Every iteration of that loop re-sends the entire conversation so far - system prompt, tool definitions, all previous messages, and every tool result. Input tokens grow roughly quadratically with the number of steps, and input tokens are what you pay for over and over.

Here is a worked example. A support agent with a 3,000-token system prompt (instructions plus 12 tool definitions), handling a conversation that takes 6 LLM calls and 5 tool invocations, with tool results averaging 1,200 tokens each:

LLM call	Input tokens	Output tokens	Cost @ $3/$15 per M (Sonnet-class)
1 (user question)	3,400	150	$0.0124
2 (after tool 1)	4,750	180	$0.0170
3 (after tool 2)	6,130	160	$0.0208
4 (after tool 3)	7,490	200	$0.0255
5 (after tool 4)	8,890	170	$0.0292
6 (final answer, after tool 5)	10,260	450	$0.0375
Total	40,920	1,310	$0.142

Notice the ratio: 41k input tokens to produce 1.3k output tokens. A plain chatbot answering the same question in one call would use maybe 3,500 input tokens - the agent used 12x more, and this was a well-behaved conversation. Add one retry because the model produced malformed JSON, one oversized tool result (a raw API response dumped as 8,000 tokens), and a second user turn on the same thread, and you are at $0.35-0.50 per conversation without anything going visibly wrong.

At 10,000 conversations a month that is $3,500-5,000. The good news: agent workloads are also the most optimizable LLM workloads, precisely because so much of each request is repeated content. The same structure that makes agents expensive makes prompt caching, routing, and trimming extremely effective. Teams that apply the levers in this post routinely land at 10-25% of their starting cost with no measurable quality loss. If you want to sanity-check what those savings mean for your business case, run the numbers through our agent ROI calculator.

Measure Before You Optimize

Every cost optimization effort should start with one week of per-request telemetry. Not a monthly invoice - the invoice tells you the total, not which agent, which step, or which tool is burning the money. You need four numbers on every single LLM call: input tokens, output tokens, cache-read tokens, and the model used. Every provider returns these in the API response - you just have to log them.

import time, json, logging
from dataclasses import dataclass, field

# Price table per million tokens (verify against current provider pricing)
PRICES = {
    "claude-sonnet-4-6":  {"in": 3.00, "out": 15.00, "cache_read": 0.30, "cache_write": 3.75},
    "claude-haiku-4-5":   {"in": 1.00, "out": 5.00,  "cache_read": 0.10, "cache_write": 1.25},
    "gpt-5.4-mini":       {"in": 0.75, "out": 4.50,  "cache_read": 0.19, "cache_write": 0.75},
}

@dataclass
class CostTracker:
    conversation_id: str
    calls: list = field(default_factory=list)

    def record(self, model: str, usage) -> float:
        p = PRICES[model]
        cache_read = getattr(usage, "cache_read_input_tokens", 0) or 0
        cache_write = getattr(usage, "cache_creation_input_tokens", 0) or 0
        fresh_in = usage.input_tokens  # Anthropic: excludes cached tokens
        cost = (
            fresh_in * p["in"]
            + cache_write * p["cache_write"]
            + cache_read * p["cache_read"]
            + usage.output_tokens * p["out"]
        ) / 1_000_000
        self.calls.append({
            "model": model, "in": fresh_in, "out": usage.output_tokens,
            "cache_read": cache_read, "cost_usd": round(cost, 6),
            "ts": time.time(),
        })
        return cost

    def summary(self) -> dict:
        return {
            "conversation_id": self.conversation_id,
            "llm_calls": len(self.calls),
            "total_cost_usd": round(sum(c["cost_usd"] for c in self.calls), 4),
            "total_input": sum(c["in"] for c in self.calls),
            "total_output": sum(c["out"] for c in self.calls),
            "cache_hit_tokens": sum(c["cache_read"] for c in self.calls),
        }

# In your agent loop, after every provider call:
# cost = tracker.record("claude-sonnet-4-6", response.usage)
# logging.info(json.dumps(tracker.summary()))

Ship these summaries to whatever you already use - Postgres, ClickHouse, or plain structured logs. Then answer five questions before touching any code:

1. What is my p50 and p95 cost per conversation? The p95 usually matters more - a small tail of runaway conversations often accounts for 30-50% of spend. 2. How many LLM calls does an average conversation make? More than 8-10 usually signals loops or tool retries. 3. What fraction of input tokens are repeated prefix? This predicts your prompt caching savings directly. 4. What share of conversations actually needed the expensive model? This predicts routing savings. 5. Which tools return the largest results? These are your trimming targets.

If you built your agent with LangGraph, the callback system gives you per-node token usage for free - see our LangGraph tutorial for the callback setup. On the OpenAI side, the Agents SDK exposes usage on every run result - covered in our OpenAI Agents SDK tutorial.

The Seven Cost Levers, Ranked by ROI

Here is the master table. Savings percentages are typical ranges for agent workloads specifically - chatbots and one-shot pipelines see less benefit from the top rows because they repeat less context.

#	Lever	Typical savings	Effort	Quality risk	Best for
1	Prompt caching	50-75% of input cost	Hours	None	Any agent with a stable system prompt and tools
2	Model routing	40-70% overall	Days	Low-medium	Mixed-difficulty traffic (support, ops, triage)
3	Context trimming and summarization	20-50% of input cost	Days	Low	Long conversations, verbose tools
4	max_tokens caps and structured outputs	10-30%	Hours	None	Retry-prone extraction and JSON tasks
5	Batch API	Flat 50% on eligible traffic	Hours	None	Evals, enrichment, nightly jobs, digests
6	Semantic response caching	5-30%	Days	Medium	High-repetition query patterns (FAQ-heavy)
7	Cheaper provider or self-hosting	Varies widely	Weeks	Medium	High steady volume - see the self-host section below

Two things about this ranking. First, the levers stack multiplicatively. Caching that cuts input cost 70%, routing that halves the average model price, and trimming that removes 30% of tokens combine to roughly a 10x reduction, not a 150% one. Second, the top four carry essentially zero quality risk when implemented correctly - they change what you pay for tokens, not which tokens the model reasons over (trimming excepted, and done right it often improves quality by removing noise).

Work the list top to bottom. Most teams stop after lever 4 because the remaining spend no longer justifies engineering time. The sections below implement each lever.

Lever 1: Prompt Caching (90% Off Repeated Input)

Prompt caching is the single most effective lever for agents, because agents are pathological prefix-repeaters. Call 6 in our worked example re-sent everything from calls 1 through 5. The provider already processed those tokens - caching lets you pay a fraction to reuse the computed KV cache instead of paying full price to reprocess them.

Anthropic pricing mechanics (as of mid-2026): cache writes cost 1.25x base input for the default 5-minute TTL, or 2x for the extended 1-hour TTL. Cache reads cost 0.1x base input - a 90% discount. On Sonnet-class pricing of $3 per million input tokens, cached tokens cost $0.30 per million. The cache refreshes its TTL on every hit, so an active conversation keeps its cache alive indefinitely. Break-even math: at 1.25x write plus 0.1x read, the cache pays for itself on the very first reuse.

You opt in explicitly with cache_control breakpoints. Place them at the boundaries of stable content - after tool definitions, after the system prompt, and after the conversation history:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=[
        # ... 11 tool definitions ...
        {
            "name": "search_orders",
            "description": "Search customer orders by email or order ID.",
            "input_schema": {...},
            # Breakpoint 1: tools change on deploys only
            "cache_control": {"type": "ephemeral"},
        },
    ],
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,  # instructions, policies, examples
            # Breakpoint 2: stable per deployment
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        *conversation_history[:-1],
        {
            "role": "user",
            "content": [{
                "type": "text",
                "text": latest_user_message,
                # Breakpoint 3: caches the whole history prefix,
                # so the next agent-loop iteration reads it at 0.1x
                "cache_control": {"type": "ephemeral"},
            }],
        },
    ],
)
print(response.usage.cache_read_input_tokens)   # should be large
print(response.usage.cache_creation_input_tokens)

Three rules that trip people up. Rule one: the prefix must be byte-identical. A timestamp in your system prompt, a randomly ordered tool list, or a per-request user ID at the top of the prompt breaks the cache for everything after it. Put all dynamic content at the end of the prompt. Rule two: mind the minimum. Anthropic requires 1,024+ tokens per cacheable block on most models (2,048 on Haiku-class) - do not add breakpoints around tiny fragments. Rule three: verify with usage fields. If cache_read_input_tokens is zero on the second call of a conversation, your prefix is not stable - diff two consecutive requests and find the mutation.

OpenAI caching is automatic: prompts over 1,024 tokens get prefix caching without any code changes, with cached tokens discounted 75-90% depending on the model family. You still need the same discipline - stable prefix, dynamic suffix - to actually hit the cache. Gemini offers both implicit caching and explicit context caching, with cache reads at roughly 10% of the base input price plus an hourly storage fee for explicit caches.

Expected result on a real agent: if 85% of your input tokens are repeated prefix (typical), caching at 90% off turns your input bill into 0.15 + 0.85 x 0.1 = 0.235x of the original - a 76% reduction in input cost for an afternoon of work.

Lever 2: Model Routing (Cheap Model First, Escalate on Uncertainty)

Frontier models are priced 4-25x above their small siblings, and most production agent traffic does not need frontier reasoning. Compare the tiers (per million tokens, mid-2026 list prices - always verify current rates):

Tier	Model examples	Input / Output per M	Use for
Frontier	Claude Opus 4.x	$5.00 / $25.00	Multi-step planning, hard reasoning, high-stakes output
Workhorse	Claude Sonnet 4.6, GPT-5.x mid tier	$3.00 / $15.00	Default agent driver, complex tool use
Fast	Claude Haiku 4.5, GPT-5.4 Mini	$0.75-1.00 / $4.50-5.00	Routine tool loops, classification, summarization
Nano	GPT-5.4 Nano, Gemini 2.5 Flash-class	$0.20-0.30 / $1.25-2.50	Routing decisions, extraction, guardrails

The routing pattern that works in production is cheap-first with escalation, not a fancy learned router. A nano-tier classifier (or a heuristic) decides the difficulty tier, the cheap model attempts the task, and you escalate when confidence signals fire:

CHEAP = "claude-haiku-4-5"
STRONG = "claude-sonnet-4-6"

ESCALATION_SIGNALS = [
    lambda r: r.stop_reason == "max_tokens",          # ran out, likely rambling
    lambda r: "i'm not sure" in r.text.lower(),
    lambda r: r.tool_error_count >= 2,                # repeated tool-call failures
    lambda r: r.self_reported_confidence < 0.6,       # ask the model to rate itself
]

def run_with_routing(task, context):
    # Stage 1: route obviously-hard tasks straight to the strong model
    if task.category in {"refund_dispute", "multi_account", "legal"}:
        return run_agent(STRONG, task, context)

    # Stage 2: cheap model attempts the task
    result = run_agent(CHEAP, task, context)

    # Stage 3: escalate on uncertainty, reusing the same conversation state
    if any(sig(result) for sig in ESCALATION_SIGNALS):
        log_escalation(task, result)
        return run_agent(STRONG, task, context)

    return result

The economics: if 70% of traffic resolves on the fast tier at roughly 1/4 the price, and 30% escalates (paying the cheap attempt as overhead), your blended cost is about 0.7 x 0.25 + 0.3 x 1.25 = 0.55x - a 45% cut. The overhead of failed cheap attempts is real but small, because the cheap attempt itself costs a quarter of a strong-model attempt.

Two implementation notes. First, the self-reported confidence signal is crude but surprisingly effective when combined with the structural signals (max_tokens hits, tool errors) - do not rely on it alone. Second, log every escalation and review weekly: if a task category escalates more than half the time, route it straight to the strong model and stop paying the failed-attempt tax. For guidance on which provider's cheap tier holds up best for business agent workloads, see our comparison of ChatGPT vs Claude for business agents, and if you are considering open-weight models as a routing tier, our roundup of the best open-source agent models covers what actually handles tool calls reliably.

Lever 3: Context Trimming and Summarization Memory

Every token in your context is billed on every subsequent LLM call in the loop. A 5,000-token tool result in step 2 of a 8-step conversation gets billed roughly 7 times. Context hygiene is therefore not a one-time saving - it compounds through the whole conversation. Three techniques, in order of impact:

1. Trim tool outputs at the source. The most common agent cost bug is dumping raw API responses into the context. A CRM lookup returns 6,000 tokens of JSON when the agent needs 12 fields. Truncate, project fields, and cap every tool's output:

MAX_TOOL_TOKENS = 800

def clean_tool_result(raw: dict, fields: list[str]) -> str:
    projected = {k: raw[k] for k in fields if k in raw}
    text = json.dumps(projected, ensure_ascii=False)
    tokens = estimate_tokens(text)  # tiktoken or provider count_tokens API
    if tokens > MAX_TOOL_TOKENS:
        text = truncate_to_tokens(text, MAX_TOOL_TOKENS) + '... [truncated]'
    return text

2. Summarize old turns instead of carrying them verbatim. Once a conversation passes a threshold, compress everything except the last few turns into a short summary. Use a nano-tier model for the summarization itself so the compression is nearly free:

SUMMARIZE_AFTER_TOKENS = 12_000
KEEP_RECENT_TURNS = 6

def compact_history(messages: list, client) -> list:
    if total_tokens(messages) < SUMMARIZE_AFTER_TOKENS:
        return messages

    old, recent = messages[:-KEEP_RECENT_TURNS], messages[-KEEP_RECENT_TURNS:]
    summary = client.messages.create(
        model="claude-haiku-4-5",   # cheap model does the compression
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": "Summarize this conversation for an agent's memory. "
                       "Keep: user goals, decisions made, IDs and values "
                       "referenced, unresolved items. Max 300 words.\n\n"
                       + render(old),
        }],
    ).content[0].text

    return [
        {"role": "user", "content": f"[Conversation summary]\n{summary}"},
        {"role": "assistant", "content": "Understood, continuing."},
        *recent,
    ]

3. Drop stale tool results entirely. After the agent has acted on a tool result, later loop iterations rarely need the full payload - replace it with a one-line stub like [search_orders returned 3 results, order #4821 selected]. LangGraph makes this clean via a message-pruning node before the LLM node.

One caution: summarization interacts with prompt caching. Rewriting history invalidates the cached prefix, so compact at natural boundaries (start of a new user turn, not mid-loop) and make the threshold generous enough that compaction is rare. The same tension applies to RAG-heavy agents, where retrieved chunks bloat context fast - our guide to building a RAG agent covers retrieval budgets that keep chunk counts sane.

Lever 4: Structured Outputs and max_tokens Caps

Two small changes that consistently pay for themselves the same day.

Kill retries with structured outputs. Every malformed JSON response is a full-price retry - you pay the entire input context again for a formatting failure. If your agent extracts data or returns machine-readable output, use the provider's structured output mode (OpenAI response_format with a JSON schema, Anthropic tool-use with a single forced tool) so the output is schema-valid by construction. Teams commonly find 5-15% of their extraction spend was silently going to parse-and-retry loops. The fix is a few lines:

# Anthropic: force a single tool call to guarantee schema-valid output
response = client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=400,
    tools=[{"name": "record_ticket", "input_schema": TICKET_SCHEMA,
            "description": "Record the classified ticket."}],
    tool_choice={"type": "tool", "name": "record_ticket"},
    messages=[{"role": "user", "content": ticket_text}],
)
data = response.content[0].input  # already valid against TICKET_SCHEMA

Cap max_tokens per call type. Output tokens cost 4-5x input tokens, and an uncapped agent occasionally produces a 3,000-token essay where a 200-token answer was needed. Set differentiated caps: 300-500 for intermediate reasoning steps in the loop, 150 for classifications, 1,000-1,500 for final user-facing answers. Then alert on stop_reason == "max_tokens" - a spike means either the cap is too tight or, more often, the model is looping and you just saved yourself from paying for the whole spiral. Combined with a hard ceiling on loop iterations (10-15 for most agents), this bounds your worst-case cost per conversation, which matters more for your p95 than any average-case optimization.

Lever 5: Batch APIs - a Flat 50% for Anything Async

Both Anthropic and OpenAI offer batch endpoints: submit up to tens of thousands of requests, get results back within 24 hours (usually much faster, often under an hour), and pay a flat 50% discount on both input and output tokens. On Anthropic, the batch discount stacks with prompt caching, so a cached batch request can cost as little as 5% of the standard rate on its cached portion.

Agents feel real-time, so teams assume batch does not apply to them. Look again at everything around your agent:

Workload	Real-time needed?	Batch-eligible
Live agent conversations	Yes	No
Nightly conversation summaries for CRM	No	Yes
Eval suites and regression tests	No	Yes
Document / knowledge-base enrichment	No	Yes
Ticket backlog classification	No	Yes
Weekly analytics digests	No	Yes

In many agent products, 20-40% of total token spend turns out to be these offline jobs, quietly running at full price on the real-time endpoint because that was the code path that already existed. Moving them is mostly plumbing:

batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"summary-{conv.id}",
            "params": {
                "model": "claude-haiku-4-5",
                "max_tokens": 400,
                "messages": [{"role": "user",
                              "content": SUMMARY_PROMPT + conv.transcript}],
            },
        }
        for conv in yesterdays_conversations
    ],
)
# Poll batch.processing_status, then stream results by custom_id

Rule of thumb: anything with a latency tolerance above one hour goes through batch, full stop. Your evals especially - teams that run serious eval suites on every deploy often find evals rival production traffic in token volume.

Lever 6: Semantic Caching of Responses

Prompt caching discounts repeated input; semantic caching skips the LLM call entirely when a semantically equivalent request was already answered. Embed the incoming query, search a vector store of previous query-response pairs, and return the stored response above a similarity threshold:

CACHE_THRESHOLD = 0.95  # start strict, loosen with monitoring

def semantic_cache_lookup(query: str, user_context_hash: str):
    emb = embed(query)  # small embedding model, fractions of a cent
    hit = vector_store.search(
        emb, top_k=1,
        filter={"context_hash": user_context_hash},  # never cross users/state
    )
    if hit and hit.score >= CACHE_THRESHOLD and not hit.is_expired():
        return hit.response
    return None

Be honest about where this applies. It works for stateless, high-repetition queries: product questions, policy lookups, "how do I reset my password" traffic. It is dangerous for anything personalized or stateful - "where is my order" must never return another customer's cached answer, which is why the context_hash filter above is non-negotiable. Practical guardrails: cache only first-turn queries (not mid-conversation), set TTLs matching how fast the underlying facts change (hours for pricing, days for how-to content), and log cache serves so you can audit for wrong answers.

Expected impact is workload-dependent: FAQ-heavy support agents see 20-30% of first-turn queries served from cache; low-repetition internal tools see close to zero. Measure your query redundancy before building this - it is the only lever on the list where "skip it" is frequently the right call.

Worked Example: Support Agent From $0.42 to $0.06 per Conversation

Here is an illustrative worked example (representative numbers, not a specific client engagement) showing the levers stacking on a typical e-commerce support agent: Sonnet-class model, 3k-token system prompt and tools, average 7 LLM calls and 5 tool calls per conversation, 15,000 conversations per month.

Step	Change	Cost / conversation	Monthly (15k convs)
Baseline	Sonnet-class everywhere, no caching, raw tool outputs	$0.42	$6,300
+ Prompt caching	cache_control on tools, system prompt, history	$0.16	$2,400
+ Context trimming	Tool outputs capped at 800 tokens, history compaction	$0.11	$1,650
+ Model routing	Haiku-class for 72% of conversations, escalation on signals	$0.065	$975
+ Token caps and structured outputs	Per-step max_tokens, forced tool-use for extraction	$0.058	$870
Result	All levers combined	$0.06	~$900 (86% reduction)

Quality held because none of these levers degrade the tokens the model reasons over for hard cases: caching is a pure pricing mechanism, trimming removed noise the model was ignoring anyway, and routing kept the strong model for the 28% of conversations that needed it (verified with a 500-conversation eval set scored before and after each step - never ship a routing change without an eval gate).

Note the shape of the curve: the first lever took out 62% of the cost in one afternoon; the last lever took out 11% of what remained. This is typical, and it is why the ranked ordering matters. For more end-to-end economics of agent projects - including the revenue side, not just cost - see our breakdown of real AI agent ROI examples.

When to Consider Self-Hosting Instead

Everything above optimizes the API-side bill. At some volume, a different question appears: should you stop paying per token at all and run open-weight models on your own GPUs? The short answer is that self-hosting starts to make sense when your post-optimization API spend sits consistently above roughly $2,000-3,000 per month on workloads a good open model can handle, and you have (or want) the ops capability to run inference infrastructure.

The key phrase is post-optimization. A common and expensive mistake is comparing a raw, unoptimized API bill against self-hosting costs. Apply the levers in this post first - if caching and routing take you from $6,000 to $900 a month, the self-hosting business case that looked obvious just evaporated, along with the on-call burden it would have brought. The full decision framework - break-even math, hidden ops costs, quality gaps, hybrid setups - is covered in our dedicated guide to self-hosting vs the OpenAI API, so we will not re-litigate it here.

If you do cross that threshold, the practical path is well-trodden: our walkthrough on deploying LangGraph with vLLM in production gets a self-hosted agent stack running in about 30 minutes, and the broader self-hosted LLM agent stack guide covers the surrounding components. Many teams land on a hybrid: self-hosted small model for the high-volume cheap tier, API frontier model for escalations - which is just lever 2, model routing, with one route pointing at your own hardware.

Cost Monitoring and Alerts That Catch Regressions

Optimization without monitoring decays. A deploy that adds a timestamp to the system prompt silently kills your cache hit rate, and you find out on the invoice three weeks later. Wire these five signals into whatever alerting you already run:

Metric	Alert condition	What it catches
Cache hit ratio (cache_read / total input)	Drops below 60% for 1 hour	Prefix mutations breaking the cache
Cost per conversation, p50 and p95	p50 up 25% day-over-day	Prompt bloat, new verbose tools, model config drift
LLM calls per conversation	p95 above 15	Agent loops, tool retry storms
Escalation rate (routing)	Above 45% sustained	Cheap-tier degradation, routing threshold drift
Daily spend vs budget	Projected month-end above budget	Everything else, including abuse and traffic spikes

Two operational habits complete the picture. First, set hard provider-side spend limits in the OpenAI and Anthropic consoles as a backstop - an agent bug that loops on a tool error can burn hundreds of dollars an hour, and application-level circuit breakers occasionally fail. Second, add a per-conversation cost ceiling in the agent itself: if the running tracker from the measurement section crosses, say, $0.50, stop the loop, return a graceful handoff message, and page someone. Runaway conversations should be a bounded incident, not an open-ended one.

Tag every request with an agent name and feature via metadata fields so the invoice decomposes cleanly. When finance asks "why did June spike," the answer should take one query, not one week.

Six Mistakes That Undo Your Savings

1. Dynamic content at the top of the prompt. A date, a user name, or a session ID in the first line of the system prompt invalidates the cache for every token after it. Everything stable first, everything dynamic last. This one mistake is behind the majority of "caching doesn't work for us" complaints.

2. Routing by request length instead of difficulty. Short requests are not easy requests ("cancel my account and dispute the last three charges") and long ones are not hard (a pasted email that needs a two-line summary). Route on task category and escalation signals, not token counts.

3. Over-trimming context until quality drops, then blaming the model. Trimming is safe when you remove redundancy; it is destructive when you remove the facts the model needed. Always A/B trimming changes against an eval set, and keep the summarization prompt explicit about preserving IDs, amounts, and decisions.

4. Optimizing the model bill while ignoring retries and loops. If 10% of conversations retry the whole chain twice due to flaky tools or parse failures, you have a 20% cost tax that no pricing lever fixes. Fix the tool reliability and use structured outputs first.

5. Comparing raw API costs against self-hosting. Covered above, but it bears repeating: the fair comparison is optimized API cost versus fully loaded self-hosting cost including engineering time and on-call. Run both sides honestly - our self-host vs API guide has the worksheet.

6. Optimizing before measuring. Teams regularly spend a week on semantic caching (lever 6) while their cache hit rate sits at zero (lever 1) because nobody looked at the usage fields. One week of per-request telemetry ranks the levers for your workload and prevents this entirely.

Next Steps

The playbook in one paragraph: instrument per-request cost first, then apply the levers in ROI order - prompt caching (hours, up to 76% off input), model routing (days, roughly half the blended rate), context trimming (compounds through every loop iteration), structured outputs and token caps (bounds your worst case), batch for everything async (flat 50%), semantic caching only if your traffic repeats. Re-check the self-hosting question only after the API bill is optimized. Most teams cut 70-90% of agent spend with the first four levers and never need the rest.

If you want to go deeper, our Production-Grade Agent Engineering course dedicates a full module to cost engineering, evals, and the monitoring setup from this post, built around real agent codebases. And if you would rather have this done for you - a cost audit of your existing agents, the levers implemented, and monitoring wired up - work with us directly or check our engagement pricing. A one-week cost audit typically pays for itself within the first month of savings.

FAQ

How much does prompt caching actually save for an AI agent?

For typical agents, 80-95% of each request's input tokens are repeated prefix (system prompt, tool definitions, conversation history). With Anthropic cache reads at 0.1x the base input price and OpenAI's automatic caching discounting cached tokens 75-90%, the input bill usually drops 60-76%. Since agents are input-dominated (often 20-40x more input than output tokens), that translates to a 50-70% cut in total spend from this one lever.

Why is my Anthropic cache_read_input_tokens always zero?

The cached prefix must be byte-identical between requests. The usual culprits are a timestamp or user-specific value early in the system prompt, tool definitions serialized in a non-deterministic order, or cacheable blocks under the minimum size (1,024 tokens on most Claude models, 2,048 on Haiku-class). Diff two consecutive raw request payloads and find the first byte that changes - everything after it misses the cache.

What is a reasonable LLM cost per request for a production agent?

It varies enormously by task, but useful reference points: a well-optimized support or ops agent lands at $0.03-0.10 per conversation, an unoptimized one at $0.30-0.60, and complex research or coding agents at $0.50-5.00 per task even after optimization. The more actionable numbers are your own p50 and p95 per conversation tracked over time, plus cost as a percentage of the value each conversation creates.

Does model routing hurt answer quality?

Not if you gate it with evals. Build a set of 200-500 representative conversations with scored outcomes, measure the strong model's baseline, then measure the routed system. Cheap-first with escalation typically keeps 60-80% of traffic on the fast tier while the eval score stays within noise, because escalation signals (max_tokens hits, tool errors, low self-rated confidence) catch most cases the small model fumbles. Ship routing behind the eval gate and review escalation logs weekly.

Should I use the batch API for my agent's live traffic?

No - batch endpoints return results asynchronously (up to 24 hours, usually faster), which does not work for interactive conversations. Use batch for everything around the agent instead: eval suites, nightly conversation summaries, document enrichment, backlog classification, analytics digests. These offline jobs are often 20-40% of total token spend and the flat 50% discount applies to all of it, stacking with prompt caching on Anthropic.

Is semantic caching safe for a support agent?

Only with strict guardrails. Cache exclusively stateless, non-personalized queries (policy questions, how-to content), partition the cache by user or context hash so one customer can never receive another's answer, use a high similarity threshold (0.95+) to start, set TTLs matched to how fast the underlying facts change, and log every cache serve for audit. If your traffic is mostly account-specific, skip this lever - the risk outweighs the single-digit savings.

At what monthly spend should I consider self-hosting instead of APIs?

As a rough threshold, when your post-optimization API bill consistently exceeds $2,000-3,000 per month on workloads an open-weight model handles well, and you have the ops capacity to run GPU inference. The critical word is post-optimization: apply caching, routing, and trimming first, because they frequently shrink the bill below the point where self-hosting pays. Our self-host vs OpenAI API cost guide walks through the full break-even math including hidden operational costs.

How do I stop a buggy agent from burning hundreds of dollars in a loop?

Layer three defenses. In the agent: a hard cap on loop iterations (10-15) and a per-conversation cost ceiling that halts the run and returns a graceful handoff. In your monitoring: an alert when LLM calls per conversation p95 exceeds normal or daily spend projects over budget. At the provider: hard spend limits configured in the OpenAI and Anthropic consoles as the final backstop. Each layer catches failures the previous one misses.

All posts

2026-06-27