So Daniel sent us this one, and it's a meaty one. He's asking about agent cost optimization and monitoring for production deployments. The full breakdown, part one and part two. Part one covers model routing, prompt caching, token budgets, and response caching. Part two gets into per-step cost tracking, monitoring tools, alert patterns, and cost attribution across multi-agent systems. The through-line is eliminating surprise bills and getting complete visibility into what your agents are actually spending. And Daniel being Daniel, he wants the full engineering playbook, not the hand-wavy version.
Good timing on this topic. By the way, today's episode is being written by Claude Sonnet four point six, which adds a certain irony to a discussion about keeping LLM costs under control.
The AI writing the script about not spending too much on AI. Totally normal.
We live in interesting times. So let's start with the problem statement, because I think a lot of people underestimate how fundamentally different LLM billing is from any other infrastructure cost they've managed before. With a database or a server, you pay for uptime. With LLMs, you pay for every token processed, at rates that span three orders of magnitude depending on which model you're calling. A single misconfigured agent can burn through five figures in forty-eight hours.
That's not hypothetical, right? The forty-seven thousand dollar incident.
Not hypothetical at all. It's documented from the OpenClaw Discord. Someone's agent got stuck in a loop, forty-seven thousand dollar bill in forty-eight hours, Anthropic wouldn't refund, they found out when their card declined. Another developer burned four hundred dollars over a weekend. A third lost six hundred when a bug sent their agent into a retry loop on GPT-4 at thirty dollars per million tokens. And Galileo's research from last year found that forty percent of agentic AI projects fail primarily because of hidden costs.
Forty percent. That's not a rounding error, that's a leading cause of death for these projects.
Which is why I want to frame this episode as a practical guide for production environments. Because there's a lot of content out there about how to build agents. There's not nearly enough about how to stop them from quietly bankrupting you. Let's start with model routing, because it's where the biggest lever is.
Before you dive in, give me the scale of the problem. What does it actually cost to run an agent at any meaningful volume?
A production support agent hitting Claude Sonnet four at three dollars per million input tokens, with a three thousand token system prompt, four thousand tokens of retrieved documents per query, and growing conversation history, you're at twelve thousand tokens per call by turn eight of a conversation. Scale that to a million API calls a month and the system prompt alone costs a hundred and fifty thousand dollars monthly. Add retrieval and history and you're looking at five hundred thousand dollars before you even count output tokens.
So the first optimization you want to talk about is model routing.
The era of the single-model agent is over. The insight is straightforward: inside a single agent loop, not every step requires the same capability. A turn might involve a quick intent classification, then a tool selection decision, then a multi-hop reasoning chain, then a final response synthesis. If you route all four of those through Claude Opus four at fifteen dollars per million input tokens, you're correct but wildly wasteful. If you route all four through Haiku at thirty cents per million, you're cheap but the reasoning step will fail.
So you route each step independently.
And the cost-capability spread is enormous. Nano tier models, GPT-4o Mini, Gemini Flash-Lite, Claude Haiku, run between seven cents and thirty cents per million input tokens. Mid-tier, GPT-4o, Claude Sonnet, Gemini Flash, between fifty cents and three dollars. Frontier models, Claude Opus, GPT-5, Gemini Ultra, three to fifteen dollars. And reasoning models, o3, DeepSeek R1, Claude with extended thinking, six to sixty dollars per million. Dynamic routing across those tiers can reduce inference costs by forty to eighty-five percent while maintaining ninety to ninety-five percent of the quality you'd get from always using the most capable model.
What are the actual routing strategies? Because there's a meaningful difference between "use a cheap model sometimes" and having a real routing system.
Four main approaches. Static routing is the simplest, you hardcode a model per task type in a routing table. Intent classification goes to Haiku, complex reasoning goes to Opus, response synthesis goes to Sonnet. Zero overhead, completely predictable, but it can't adapt when the complexity of a given task varies. The second strategy is classifier-based routing. LMSYS published RouteLLM at ICLR last year, which uses a BERT-scale model, around a hundred and ten million parameters, to predict which tier will suffice for a given query. It runs in under ten milliseconds, no LLM inference needed for the routing decision itself. On MT Bench they achieved eighty-five percent cost reduction while maintaining ninety-five percent of GPT-4 quality.
And the classifier transfers across model pairs without retraining?
That's one of the key findings. You train it once and it generalizes. The third strategy is cascading, where you try the cheap model first and escalate only if confidence is low. ETH Zurich proved the theoretical optimality conditions for this approach. The tradeoff is latency, because you're making sequential calls, but you eliminate the need for an accurate upfront complexity classifier. The fourth is semantic routing, embedding-based similarity that maps queries to predefined route categories. vLLM shipped a semantic router in September using a ModernBERT-based classifier that achieved about ten percent accuracy improvement and fifty percent latency reduction over prior approaches.
And there are commercial products doing this now.
Several. OpenRouter has an auto router powered by Not Diamond's meta-model that selects among dozens of models automatically. Martian was the first commercial LLM router, backed by nine million from NEA and General Catalyst, claims up to ninety-eight percent cost reduction, used by over three hundred companies. Amazon Bedrock shipped Intelligent Prompt Routing to general availability in April last year, routes within model families, Claude three point five Sonnet versus Claude three Haiku, claims sixty percent cost reduction with no additional API cost. And Not Diamond does something interesting on top of routing, it also does prompt adaptation, automatically rewriting prompts to better suit the selected model, which they say produces five to sixty percent accuracy improvements on top of the cost savings.
Alright, let's talk about prompt caching, because this one is particularly interesting to me. The mechanism is different depending on the provider.
Completely different, and the implementation details matter a lot. The core idea is that the provider stores the key-value tensors computed for your prompt prefix. When multiple requests share the same prefix, the cache is hit and you skip recomputation, which reduces both cost and latency. But the critical rule is that you have to put static content first. Any change to early tokens invalidates the cache for everything that follows. If you put a timestamp or a per-request ID at the top of your prompt, you've defeated caching entirely.
Which sounds obvious but I'd bet a lot of teams are doing exactly that.
More than you'd think. The Anthropic implementation requires explicit cache control markers. You add cache control ephemeral to the system prompt block, the minimum eligible size is a thousand and twenty-four tokens, cache creation costs twenty-five percent more than standard input tokens, but cache reads cost only ten percent of standard input tokens. Cache TTL is five minutes, refreshed on each hit. And you can track it in the usage response via cache creation input tokens and cache read input tokens.
OpenAI is different.
OpenAI is automatic. No code changes required. It caches the longest common prefix of prompts longer than a thousand and twenty-four tokens. You just check prompt tokens details dot cached tokens in the usage field to see what you're saving. The implication is that for OpenAI, your job is purely structural, make sure your system prompt is stable and at the top, put variable user content at the bottom.
And for self-hosted infrastructure?
If you're running vLLM, you want to route requests that share a prefix to the same GPU worker. The cache lives on that worker, so if you load balance across workers randomly you'll miss cache hits even when the prefix is identical. Thomson Reuters documented a sixty percent cost reduction on legal document summarization just from caching the boilerplate legal context that got prepended to every query. Production teams on read-heavy workloads commonly report sixty to eighty percent reductions in input token computation from prefix-aware routing.
Let's move to token budgets, because I think this is underappreciated as an optimization. People think about it as a correctness concern, not a cost concern.
Both, and they're connected through something called context rot. LLM performance degrades before you hit the context limit. The lost-in-the-middle effect is well-documented, attention concentrates at the beginning and end of the input while content in the middle becomes unreliable. Instructions buried at turn twelve of a thirty-turn conversation may effectively disappear. The model doesn't error out, it just quietly ignores them. This starts happening at sixty to seventy percent capacity utilization even on models advertising a million-plus token context window.
So the model is silently degrading before you even hit a hard limit.
And you're paying for the tokens that are being ignored. Which is why you need budget tiers that allocate by priority rather than chronology. Protected content, the system prompt and current query, always included. High priority, current tool results and the latest retrieved documents, about thirty percent of the remaining window. Medium priority, the last five turns of conversation history, about twenty-five percent. Low priority, older history, gets the remainder. When you approach the limit, you compress from the bottom up and never touch protected content.
The summarization pattern is interesting here. Walk me through that.
After every eight to twelve turns, you trigger a background summarization pass. You replace raw history older than N turns with a structured summary injected as a system message. A structured summary storing user context, decisions made, and current task state can replace two thousand tokens of dialogue with eighty tokens. A twenty twenty-five analysis found structured summaries over eight to twelve turns reduce per-turn token usage by forty to sixty percent with negligible accuracy loss. The tradeoff is one additional LLM call per summarization cycle, which is usually worth it, but you want to run that summarization with a cheap model.
There's also the question of using the right tokenizer, which people get wrong.
Badly wrong. Using tiktoken to count Claude tokens can produce estimates off by thirty to fifty percent. Each model family has its own tokenizer. GPT-4o uses tiktoken with the o200k base encoding. Anthropic has a free count tokens endpoint that doesn't consume rate limits. Gemini uses a SentencePiece-based tokenizer accessible through the SDK. Llama requires the exact HuggingFace tokenizer from the repository. And each message in OpenAI's Chat Completions API adds about four tokens for the ChatML wrapper overhead. These aren't rounding errors at scale.
The fourth optimization is LLM response caching, which is different from prompt caching. This is memoizing the responses themselves.
Three-layer architecture. Layer one is exact match caching. You SHA-256 hash the prompt plus model plus temperature, use that as a Redis key, TTL of twenty-four hours. Sub-millisecond lookup. Best for FAQ systems, documentation search, chatbots with predefined flows. Layer two is semantic caching. You generate an embedding for the incoming prompt, do a vector similarity search against cached embeddings, and return a cached response if similarity exceeds your threshold. The reference implementation uses Redis Vector Search with FLAT algorithm and cosine distance. Layer three is the provider-level prompt caching we just discussed.
What's the right similarity threshold for semantic caching?
Above ninety-five you're very strict, only nearly identical queries match. Ninety to ninety-five is balanced, good for most use cases. Eighty-five to ninety is lenient, you get more hits but risk returning an irrelevant cached response. Below eighty-five is too loose for production. And a twenty twenty-five routing paper made an important point that semantic similarity alone is fundamentally inadequate for production caching, you need to combine it with recency and context match signals. The combined impact of all three layers is meaningful: exact match alone gives you fifteen to thirty percent hit rates, semantic alone gets you twenty-five to forty-five percent, combined you're at forty to sixty percent, and adding provider-level prompt caching on top gets you to fifty to seventy percent hit rates with forty-five to seventy percent cost reduction.
Alright, let's shift to part two. Monitoring and cost tracking. You can't optimize what you can't measure, and a lot of teams are flying completely blind here.
The production standard is OpenTelemetry. You set up both metrics for aggregated views and traces for per-request detail, pointing at the same backend. The key metrics to instrument are token usage histograms for input, output, and total tokens, a cost histogram in USD, a latency histogram in milliseconds, a request counter, and an error counter. The instrumented call pattern wraps every LLM call in a trace span, measures latency with a high-resolution timer, extracts token counts from the usage field, calculates cost from a pricing table you maintain, and records everything to both the metrics system and the span attributes.
The feature tagging piece is crucial for attribution.
Essential. Every metric label needs a feature field, something like document summarization or translation or chat. This lets you produce queries like total daily cost broken down by model and feature, which is how you find out that your translation feature is spending three times what you expected because it's hitting GPT-4o when it should be hitting mini. You also want to measure time to first token separately from total latency for streaming responses, because TTFT is what users perceive as responsiveness and it doesn't correlate with total generation time.
And there's a logging practice that most teams get wrong.
Log the complete prompt, system message plus conversation history plus tool definitions, for every LLM call. Not just the response. Most teams log responses but not prompts, which makes debugging nearly impossible. When something goes wrong at turn seven of a twelve-turn conversation, you need to see exactly what the model was given, not just what it produced. Yes, it's verbose. It will save hours when you're trying to understand why an agent went sideways.
Let's go through the monitoring tools. There are several in the market and they're not interchangeable.
Quick verdict on each. LangSmith is the right choice if you're on LangChain, because the integration is automatic, zero configuration, it traces chains, agents, tools, and retrievers without any code changes. For non-LangChain code you use the traceable decorator. It has strong built-in evaluators for hallucination and relevance, and annotation queues for human review. Pricing is free up to five thousand traces, then thirty-nine dollars a month for fifty thousand. Overhead is fifteen to thirty milliseconds at the fiftieth percentile. Top rating for evaluation workflows.
Helicone is different in approach.
Proxy approach, which is the key distinction. You change your base URL and add one header. No SDK, no code changes, works with any provider. It gives you real-time cost dashboards broken down by user, model, and feature. Built-in semantic caching with configurable similarity thresholds. Rate limiting per user or API key. Free up to a hundred thousand requests, then twenty dollars a month for ten million. Overhead is ten to twenty milliseconds at the fiftieth percentile. Top rating for cost tracking specifically. And the recommended production combination is Helicone for operational metrics plus LangSmith or Braintrust for evaluation. They're complementary, not competing.
What's Braintrust's angle?
Evaluation-first design. Experiments are first-class citizens. Version-controlled datasets, AI-powered evaluators using LLM judges for factuality and relevance, prompt playground with real-time evaluation against datasets. If your primary concern is whether your agent is producing good outputs rather than how much it costs, Braintrust is the right tool. Fifty dollars per seat per month on the pro plan. Galileo is worth mentioning for agent reliability specifically. They use their Luna-2 small language models for evaluation at ninety-seven percent lower cost than LLM-based evaluation, and they have an Insights Engine that automatically clusters similar failures and surfaces root-cause patterns, which is genuinely useful when you're trying to understand why a class of queries is failing.
And OpenMeter is for a different use case entirely.
Customer-facing usage dashboards and usage-based billing. If you're building an AI product and need to bill your customers by token consumption, OpenMeter handles the metering to Stripe billing pipeline. It's open source, about nineteen hundred GitHub stars, now part of Kong. The use case is: you're an AI company, your customers each have a token budget, you need to enforce limits and generate invoices. That's a different problem from internal cost tracking.
Let's get into alert patterns, because this is where the forty-seven thousand dollar incident becomes preventable.
Seven patterns, and I'd implement them in a specific priority order. Day one, budget governors and a kill switch. Week one, observability and output guardrails. Week two, circuit breakers and retry classification. Week three to four, human in the loop for high-stakes actions. The budget governors are the most important because they prevent financial damage immediately. You need limits at three levels: per request, per session, and per day. A good production configuration has a twenty thousand token limit per session, fifteen tool calls maximum per session, fifty cents maximum cost per session, and a two-minute wall clock timeout. In development you can be more generous, but tighten these for production.
The kill switch is the one I find most interesting architecturally.
It needs to be checked before every LLM call and before every tool execution. Not just once at the start of a session. The implementation fetches a remote config with a five-second TTL cache, checks a global kill switch flag, and checks whether the specific agent ID is in a disabled list. If either check fails, it throws an AgentHaltedError immediately. The reason you check before tool execution specifically is that an agent can do real-world damage through tool calls, writing files, making API calls, sending messages, even after you've stopped the LLM from generating more tokens.
Circuit breakers are well-understood in traditional distributed systems but people don't always apply them to agent tool calls.
The pattern is identical to the microservices version. You track failure counts per tool, and after five failures you open the circuit, which means any call to that tool immediately throws a CircuitOpenError for a sixty-second cooldown period. The important detail is what you do with that error: you feed it back to the agent as context. Something like "The service is temporarily unavailable, try an alternative approach." This lets the agent adapt rather than just failing.
The retry-classify pattern is one I hadn't seen articulated this way before.
Blind retry is the enemy. When an error occurs, you classify it first. Rate limit errors get exponential backoff. Validation errors get a repair strategy, where you feed the specific error message back to the LLM and ask it to self-correct. Authentication errors fail fast immediately, because retrying won't help. Timeout errors get backoff. Unknown errors fail fast, because you don't want to waste tokens on something you don't understand. Teams using the repair strategy on schema-level validation failures report success rates well above fifty percent. That's a lot of calls that would otherwise fail completely.
Stuck detection is underrated as a pattern.
Straightforward to implement. You look at the last five tool calls in the history. If more than eighty percent of them are the same tool, the agent is stuck. At that point you inject a meta-prompt: "You appear to be repeating the same action without progress. Stop and reconsider your approach." This is much cheaper than letting the agent loop until it hits a budget ceiling.
Portkey has a structured alert escalation that's worth mentioning.
Four-level system. At seventy percent of budget, notify the team lead. At one hundred percent, notify finance and engineering. At a hundred and twenty percent, trigger escalation or lockout. And the enforcement options aren't just alerts, you can block further requests, throttle usage, or route to a cheaper model automatically when limits are approached. AgentCap's approach is similar but uses a rolling average: if spend is three times the rolling hourly average, trigger an instant alert regardless of absolute budget position.
Let's close out with cost attribution in multi-agent systems, because this is where things get complicated fast.
The core problem is that costs multiply non-linearly in multi-agent systems. Each agent receives context from upstream agents, adds its own reasoning, and passes a longer message downstream. Production analysis has found coordination chatter alone increasing token counts by four times compared to single-agent approaches. So attribution isn't just about knowing what you spent, it's about understanding which agent in the chain is the expensive one.
The metadata tagging approach is the foundation.
Every LLM call needs six tags minimum: model, agent ID, task ID, feature, user ID, and session ID. This lets you build a hierarchical cost rollup from per-step cost all the way up to per-feature cost. The trace ID propagation piece is what ties it together across agents. You assign a trace ID at the orchestrator level and propagate it through all sub-agent calls. This lets you reconstruct the full cost of a single user request across every agent that touched it, which is how you find out that your research agent is spending three times what the summarization agent spends on the same task.
And the per-agent budget limits are separate from session limits.
Separate, and this is where AgentCap's approach is elegant. One header in your config per agent, every agent tracked independently. You can have a research agent with a two-dollar budget and a summarization agent with a twenty-cent budget, and you know immediately which one went over. Without per-agent limits, you get a session-level bill and have to dig through logs to figure out who caused it. Langfuse supports routing-aware tracing that captures the selected model on every span, which is useful when you're also doing model routing, because you can see the interaction between routing decisions and cost attribution.
So if I'm putting this together as a practical guide, what does the implementation roadmap look like?
Day one, you implement budget governors and a kill switch. These are the emergency brake and the seatbelt. You cannot go to production without them. Week one, you add observability, the OpenTelemetry instrumentation with full prompt logging, and you add output guardrails. Week two, circuit breakers and retry classification. Week two to three, model routing, starting with static routing because it's the easiest and you can move to classifier-based later. Prompt caching can be added at the same time, it's usually a one-day implementation change. Week three to four, semantic response caching and human-in-the-loop for any high-stakes tool executions. And throughout all of this, you're reviewing your cost dashboards weekly and tightening budget limits as you understand the actual distribution of what normal looks like.
The thing I keep coming back to is that most of these optimizations are not just cost optimizations, they're also reliability optimizations. Stuck detection, circuit breakers, token budgets, they all make your agent more robust, not just cheaper.
That's the reframe I'd encourage anyone building production agents to internalize. The monitoring and cost infrastructure isn't overhead, it's what makes the system trustworthy enough to actually run in production. A forty-seven thousand dollar incident isn't just a billing problem, it means your system ran for forty-eight hours doing something completely wrong and nobody noticed. The observability layer is how you ensure that can't happen.
Good place to land. Big thanks to our producer Hilbert Flumingtop for putting this one together, and to Modal for providing the GPU credits that power this show. This has been My Weird Prompts. If you haven't followed us on Spotify yet, search for My Weird Prompts and hit follow so you don't miss new episodes.
See you next time.