So Daniel sent us this one, and it's a topic I know a lot of people in the AI space are quietly struggling with. Here's what he wrote: the gap between vibe coding a demo agent and actually shipping one to production is enormous, and almost entirely invisible to tutorials. He wants us to cover the real production concerns — full trace logging and observability, versioning agent behavior when prompts and models change, A/B testing agent configurations, rollback strategies, handling non-determinism in testing, rate limiting at scale, and the human oversight question. When does a human need to step in, and how do you even build that? So basically, everything the notebook demo skips.
Herman Poppleberry, and yeah, this one is close to my heart. There's a quote from Logic.inc that I keep coming back to: getting an LLM agent to work in a demo takes a day, getting it to work reliably in production takes weeks, and the gap isn't the model, it's everything around it. That framing is exactly right.
And the numbers back that up in a pretty sobering way. McKinsey's twenty twenty-five survey found sixty-two percent of organizations are experimenting with or deploying AI agents, but only twenty-three percent claim to be in the scaling phase. And Gartner is predicting that forty percent of agentic AI projects will be canceled by twenty twenty-seven.
Forty percent. That's not a rounding error. That's a structural problem. And the reason Gartner cites is not that the models aren't good enough — it's reliability challenges, unclear business value, and inadequate risk controls. Which is almost exactly the list Daniel sent us.
So let's go through it properly. By the way, today's episode is powered by Claude Sonnet four point six, which feels appropriate given we're talking about what happens when you try to actually run one of these things in the real world. Start with observability, because I think this is the one that bites people first.
It does, and the reason is that your intuition about debugging completely breaks down. With a normal web service, you look at HTTP status codes, latency, error messages — the signal is in the infrastructure. With an LLM agent, the infrastructure can be perfectly healthy while the agent is producing completely wrong outputs. A two hundred response with a well-formed JSON payload can still contain garbage.
So what does useful observability actually look like here?
The core concept is a trace. A trace is a structured timeline of everything the agent did — every LLM call, every tool invocation, every intermediate prompt, every raw response before post-processing. LangSmith, which is LangChain's observability product, records all of this and you can enable it with a single environment variable: LANGSMITH_TRACING equals true. That's the easy part.
What's the hard part?
Interpreting it. Because the trace reveals four distinct failure modes that look identical from the outside. Prompt failures, where the tool outputs are correct but the final LLM synthesis is wrong. Tool failures, where a search API times out or returns empty results. Retrieval failures in RAG systems, where the vector search fetches irrelevant documents. And orchestration failures, where the agent loops, takes wrong branches, or burns through fifteen steps when three would do.
And without the trace, you're just staring at the wrong answer with no idea which of those four things happened.
It's a game of guessing, which is how one of the engineers at DigitalOcean described it. The other tools worth knowing: Arize Phoenix is strong specifically on RAG observability and embedding drift detection. Braintrust is used in production by Notion, Stripe, Vercel, Airtable, Zapier — it's become a kind of category-defining tool for LLM evaluation. LangWatch is OpenTelemetry-native, which is interesting because it means LLM observability is converging with traditional distributed systems monitoring.
OpenTelemetry convergence is actually a big deal. That's the same standard your Kubernetes cluster uses. So theoretically you could route LLM traces through the same infrastructure you're already running.
Which is exactly what teams with mature setups are doing. And one practical note on cost: you don't need to trace every single request. LangSmith supports sampling — trace five percent of traffic, or trace specific sessions, or trace on error. Once you're at scale, tracing everything is genuinely expensive.
Alright, so you've got visibility into what your agent is doing. Now you need to be able to change it without blowing everything up. Which brings us to versioning.
This is where things get philosophically interesting, because prompt versioning is categorically different from code versioning, and most teams don't realize that until they've had an incident. The canonical story from TianPan.co is perfect here. A product manager adds three words to a customer service prompt to make it "more conversational." Within hours, structured-output error rates spike and a revenue-generating pipeline stalls. Engineers spend most of a day debugging infrastructure and code. Nobody thinks to look at the prompt until much later. There's no version history. There's no rollback. The three-word change was made inline, in a config file, by someone who had no reason to think it was risky.
Three words. That's the kind of thing that gets mentioned in a Slack message and then disappears into the void.
And that's exactly what happened. The organizational reality is that prompts sit at the intersection of product intent, legal interpretation, and technical execution. No single existing role owns them naturally. The result is informal, shared non-ownership that fails catastrophically during incidents. Postmortem finding: engineers can't identify who made the last prompt change or why, because it happened in a DM, was applied directly in a GUI, and was never documented anywhere.
So what's the principled solution?
The immutability principle. Once a prompt version is published to production, it must never be modified. Any change — even a typo fix — creates a new version. This sounds obvious but it conflicts deeply with how most teams think about prompts. Prompts feel like configuration. Light. Reversible. They're not.
And then you need semantic versioning for prompts specifically.
Which maps cleanly once you think about it. A major version bump is a breaking change — structural rewrites, persona changes, output format changes that break downstream parsers, or a model switch. Minor bumps are new capabilities without altering existing behavior. Patches are typo fixes and minor wording improvements. But here's the catch: if a patch causes a measurable behavior change in your evaluation suite, it should have been a minor or major bump. The version number is a contract.
What goes into a version? Just the prompt text?
That's where most teams underscope it. The execution context is a single coherent unit: prompt template plus model name and specific version plus temperature and sampling parameters plus retrieval configuration if you're doing RAG, plus the author and the rationale for the change. Changing the model from claude-opus to claude-sonnet is a potentially breaking change regardless of whether the prompt text changed at all.
And the model version problem is sneakier than people realize, because providers update weights without telling you.
This is the provider drift problem, and it's genuinely alarming. In April twenty twenty-five, a major provider pushed a behavioral update without public announcement. Within forty-eight hours, developers noticed the model was producing outputs that failed safety checks it had previously passed. A February twenty twenty-six longitudinal study confirmed meaningful behavioral drift across deployed transformer services over a ten-week period, with attribution being impossible because providers don't release update logs.
So your agent can regress without you changing a single line of code.
Which is why the defense is a golden dataset — a curated, versioned collection of representative prompts evaluated automatically on a cadence. Block deploys when your overall score drops more than three percent relative to the main branch baseline. And critically, sample one percent of real production traffic into your evaluation queue continuously, because your curated test set will develop blind spots within weeks if you don't refresh it with real inputs.
Okay, so you can see what your agent is doing, you can version its behavior, and you have a test suite. Now you want to swap in a new model or a new prompt and measure whether it's actually better. That's A/B testing, and this is where the non-determinism makes everything harder.
The non-determinism problem is one of those things that surprises even experienced engineers. The intuition is: set temperature to zero, use greedy sampling, and you get reproducible outputs. That's wrong in practice. Even with temperature at zero, LLM APIs are not deterministic. The root cause is GPU floating-point arithmetic — operations aren't strictly associative, and batch size variability during parallel sequence processing introduces different rounding errors at inference time. Research has documented accuracy variations of up to fifteen percent across runs with identical inputs.
Fifteen percent variance with temperature zero. That's not a rounding error, that's a completely different answer sometimes.
It breaks the assumptions that traditional unit testing relies on. And it means your A/B test needs to account for this variance in its statistical design. For a five percent minimum detectable effect with eighty percent power and ninety-five percent confidence, you're looking at tens of thousands of sessions per arm. That's not a weekend experiment.
So walk me through how you actually structure the rollout when you're swapping a model or a major prompt change.
Three phases. Phase one is shadow mode — zero user exposure. You duplicate production requests to both the current model, which serves users, and the candidate model, which doesn't. You log both outputs and run an automated evaluation layer, typically an LLM judge, that compares them against your criteria. Without the automated evaluation, shadow mode just gives you a pile of logs. The downside is it roughly doubles your inference spend during evaluation, so it's the right tool for major changes, not minor prompt tweaks.
Phase two.
Canary deployment. Real users, small exposure. Start at one percent of traffic, sometimes as low as zero point one percent for high-stakes applications. Gradually increase: one to five to twenty to fifty to one hundred. The critical infrastructure requirement is consistent user assignment — a user who hits the canary on one request should hit it on subsequent requests in the same session. Randomly assigning each individual request creates an incoherent user experience.
And what are you actually measuring during canary?
Latency percentiles — p fifty, p ninety-five, p ninety-nine, not just averages, because LLM latency distributions are highly skewed. Cost per request, because token counts change with model versions. Error and refusal rates, because a new model might refuse categories of requests the old one handled. Output length distribution, because mode collapse to very short outputs or runaway verbosity both indicate problems. And user feedback signals — regeneration requests, session abandonment, thumbs down.
The refusal rate one is underappreciated. You could swap to a newer, more capable model and suddenly it's refusing ten percent of your edge cases that the old model handled fine.
And you won't discover that from infrastructure metrics. CPU is fine, error rate is fine, latency is fine. The problem only shows up in behavioral metrics. Which is why canary alone isn't enough — that's what phase three is for.
A/B testing.
Canary tells you whether the new version is safe. A/B testing tells you whether it's better. Those are genuinely different questions. For signals, implicit ones are the most reliable in practice — regeneration requests, immediate session abandonment, follow-up clarification questions. These are available in real-time with no rating infrastructure. Explicit signals like thumbs up or down are high quality but low coverage — typically only two to five percent of responses ever get rated. LLM judge evaluation fills the gap at scale.
And you need to pre-stratify before statistical analysis.
This is important. A new model might be better for factual Q and A and worse for creative tasks. Aggregating across both obscures both signals. You need to know which request types the new version wins on and which it loses on, because the answer might be "ship it for half your use cases and not the other half."
Alright, let's talk about what happens when you ship and it's wrong. Rollback.
The litmus test I keep coming back to is simple: if rolling back a prompt change takes more than fifteen minutes, your system isn't production-ready. Mature teams do it in under sixty seconds. The mechanism is changing a version pointer in a prompt registry — it's instant and it does not require a code deploy.
Sixty seconds versus fifteen minutes is a huge gap. What separates them?
Four patterns. Canary rollout, which we've covered — deploy to a small percentage, monitor quality scores, revert based on quality not just infrastructure health. Shadow testing, where users only see the production version's output while you evaluate the candidate offline. Blue-green, where you maintain two complete environments and switch all traffic at once at a known point in time — rollback is switching the pointer back. And feature flags, which decouple deployment from decision entirely.
Feature flags for agents have a specific wrinkle though.
The statefulness problem. A flag that changes which model handles a request is fine for stateless queries. It's a serious problem for multi-turn conversations — changing models mid-conversation causes jarring style shifts and context loss. Flag evaluation for conversational AI needs to lock to the session level, not the request level. A user's conversation has to stay on one version of the agent for its entire duration.
And there's another wrinkle with model version pinning.
Traditional feature flags control code paths that don't change unless you modify them. Model behavior drifts over time even if you don't touch the flag — providers update weights, behavior patterns shift. A flag pointing at gpt-4o-latest today doesn't point at the same behavior in three months. Pin model versions in your flags where possible. gpt-4o-2024-11-20 is a stable flag target. gpt-4o-latest is not.
Logic.inc has a concept they call atomic rollback that I think is worth naming explicitly.
When a deployment causes regressions, you need to revert to a known-good state across all layers simultaneously — the prompts, the tool integrations, and the models. Without that, rollback means redeploying your entire application rather than reverting a single agent change. The atomicity is what makes sixty-second rollbacks possible.
Let's get into the testing and QA side of non-determinism more deeply, because I think the sycophancy problem in particular is one that people don't see coming.
It's a subtle and genuinely nasty failure mode. When you're testing an agent, you expect that a correct answer is a pass and an incorrect answer is a fail. The sycophancy problem breaks that assumption. An LLM that's become overly agreeable will confirm whatever the user says rather than reasoning through the task. In testing, this creates the illusion of success — the agent tells a customer their account issue is resolved simply because the customer insists it is, even though the agent never confirmed the fix in the system.
So the test passes because the agent agreed with the test.
Which is particularly bad if your test inputs are phrased in a way that implies the right answer. You've essentially trained your evaluation to be sycophant-friendly. The fix requires dynamic testing — using an LLM-based simulated customer that follows a goal under varied conditions, including conditions where the simulated customer is wrong about something, and verifying that the agent doesn't just capitulate.
Cresta breaks down the testing framework into static and dynamic components. Static is historical conversations as fixed inputs for regression testing. Dynamic is the simulated customer approach.
And then there's the pass-to-the-k approach for addressing non-determinism directly. Rather than treating a single test pass as sufficient, you re-run critical tests multiple times and require consistent performance across runs. If your agent passes eight out of ten runs on a critical scenario, that's not good enough for production. You're looking for ten out of ten, or at least something close.
For evaluation methods, you've got three tiers.
Deterministic evaluation first — logic checks: did the API call succeed, was the data retrieved accurately, were the compliance steps completed. These are binary and cheap. Then expert-aligned LLM judge evaluation for nuanced behaviors — flow adherence, response relevance, factuality, hallucination detection. And then manual human review for edge cases and calibrating the judge itself, because an LLM judge that isn't calibrated against human expert judgment will drift in its own ways.
And scope matters — turn-based versus conversation-based evaluation are answering different questions.
Turn-based ensures an action occurs at a specific moment — reading a disclosure immediately after collecting payment information, for example. Conversation-based evaluates whether goals are met over an entire flow — did the agent complete the full payment sequence from authentication to confirmation while maintaining compliance throughout? Both are necessary. Turn-based catches precise behavioral requirements. Conversation-based catches the emergent failures that only appear when you look at the whole arc.
Alright, let's shift to something that feels more mechanical but has some genuinely surprising complexity: rate limiting and concurrency at scale.
The demo hits one API endpoint sequentially. Production hits it ten thousand times, with wildly variable token counts, from burst patterns you can't predict, through multiple layers of tool use that multiply the call count per user action. And here's the thing most tutorials gloss over: LLM traffic is fundamentally unpredictable in a way that normal API traffic isn't.
Because the output length is probabilistic.
One request might consume two hundred tokens, another might consume four thousand — for similar-looking inputs. That's a twenty-to-one swing in GPU demand. And in multi-step agent workflows, a single user action can trigger retrieval, then an LLM call, then a tool use, then another LLM call to synthesize the results. The token consumption compounds.
There's a concrete example from the OpenAI community forums that illustrates the concurrency problem well.
Sending a hundred and thirty-eight requests in parallel to gpt-4o-mini, roughly a hundred tokens each, fifteen thousand tokens total — return latency was forty seconds. Sending five requests, six hundred and thirty-eight tokens total — one point seven seconds. The total network latency scales non-linearly with concurrent requests. That's not a linear cost you can just budget around; it's a capacity constraint that creates unpredictable user-facing latency spikes.
So what's the architectural solution?
Four layers of rate limiting working together. Request-based limits, which protect infrastructure from sudden floods and control gateway load. Token-based limits, which are the primary mechanism for managing GPU capacity — they map directly to compute usage and cost. Cost and usage limits, which enforce budget ceilings over daily or monthly periods and prevent runaway workloads from batch jobs or agent loops. And per-team or per-application limits, which allocate independent quotas across tenants so no single workload can consume shared capacity.
The AI gateway pattern is the elegant solution here.
Instead of managing rate limits separately across each provider, you put a unified gateway in front of all your LLM traffic. The gateway enforces token and request limits uniformly, handles routing policies, manages token quotas, and provides observability from a single layer. LiteLLM is the open-source implementation a lot of teams use — you can run it as a proxy that automatically splits traffic across multiple API keys, rotating when one gets rate-limited, with per-model RPM and TPM limits and automatic fallback routing.
And fallback routing deserves a mention because it's more than just redundancy.
It's capacity management. If your primary model approaches its token quota or crosses a latency threshold, traffic automatically routes to a fallback model. Requests targeting GPT-4 can shift to another provider when capacity is exhausted, maintaining availability while staying within global rate-limit policies. The monitoring signals you're watching: HTTP 429 response counts, token spend per tenant, quota utilization percentages. A sudden spike in token spend from a single tenant is a red flag — it usually means a misconfigured agent loop or a runaway prompt cycle.
The runaway loop is the nightmare scenario. An agent that's stuck retrying something that will never succeed, burning tokens at full speed.
And it will happen. Every production agent system eventually has a runaway loop incident. The question is whether you have the monitoring to catch it in seconds or minutes, and whether your rate limits actually terminate it before it costs you thousands of dollars.
Let's close with the human oversight question, because I think this is where the stakes get real. The fifty thousand dollar refund scenario.
From Galileo's production guide. Your customer service agent approves a fifty thousand dollar refund to a fraudulent account. Your CFO wants to know how an AI system made a financial decision of that magnitude without oversight. That's not a hypothetical — that's the scenario that's driving the human-in-the-loop conversation at every enterprise that's serious about deploying agents.
And the Gartner cancellation prediction connects here. Forty percent of projects canceled by twenty twenty-seven — inadequate risk controls is one of the three reasons cited.
The distinction that matters is human-in-the-loop versus human-on-the-loop. In-the-loop means humans participate at every critical decision point — the AI acts as an advisor and no decision executes without human validation. On-the-loop means the AI operates autonomously, but humans monitor via dashboards, alerts, or sampling audits and can intervene. Most mature production systems use a mix of both depending on the decision type.
And there's a calibration question about what percentage of decisions should escalate.
The target escalation rate that Galileo cites is ten to fifteen percent — meaning eighty-five to ninety percent of decisions execute autonomously while critical cases get human review. The warning sign is when escalation rates hit sixty percent. At that point, the system is miscalibrated — you've essentially built a very expensive human routing layer, not an autonomous agent.
The confidence threshold varies dramatically by domain.
Financial services: ninety to ninety-five percent confidence threshold required before autonomous execution, due to regulatory scrutiny. Customer service for routine inquiries: eighty to eighty-five percent. Healthcare: ninety-five percent plus, given patient safety implications. These aren't arbitrary — they map to the cost of a wrong decision and the regulatory environment.
On the regulatory side, the EU AI Act is explicit.
For high-risk AI systems, natural persons must be able to oversee operation, human operators need authority to intervene in critical decisions, systems must enable independent review of AI recommendations, and override mechanisms must function without technical barriers. GDPR, SEC, CFPB, FDA guidance — all of them require meaningful human intervention in decision-making. An agent making credit decisions, insurance underwriting, or employment recommendations without human review mechanisms is a legal liability, not just an operational risk.
The synchronous versus asynchronous oversight split is the practical design question.
Synchronous approval pauses agent execution pending human authorization. Latency cost: zero point five to two seconds per decision. Use it for financial transactions above a threshold, account modifications, data deletion, anything that can't be easily reversed. Asynchronous audit lets agents execute autonomously while logging decisions for later human review — near-zero latency. Use it for content classification, recommendation systems, internal processes where you can correct mistakes retroactively.
There's a technical problem with confidence-based escalation that's easy to miss.
Neural networks are systematically overconfident. They produce high confidence scores even for incorrect predictions, which breaks threshold-based escalation strategies. Three defenses: temperature scaling, which is a post-hoc adjustment using validation sets to recalibrate probability distributions. Ensemble disagreement, where high disagreement across ensemble members triggers escalation regardless of individual confidence scores. And conformal prediction, which provides statistical guarantees through prediction sets with coverage probabilities. That last one is getting more traction in regulated industries specifically because it can provide formal coverage guarantees.
And then the feedback loop — human corrections need to actually improve the system.
This is the part that teams often skip because it requires infrastructure they don't have yet. When a human reviewer corrects an agent decision, that correction needs to be captured in a structured way — not just "this was wrong" but what the correct answer was and why, categorized in a way that enables pattern analysis across similar scenarios. And then there needs to be an automated integration pipeline feeding those corrections into retraining or fine-tuning workflows. If human corrections just fix individual errors without improving the system, you've built a very expensive manual review process, not a learning system.
So let's pull this together into something actionable. If you're building an agent today, what's the order of operations?
Observability first, before you ship anything. If you can't see what your agent is doing step by step, you cannot debug it in production. Enable tracing from day one — LangSmith if you're on LangChain, LangWatch if you want OpenTelemetry-native, Helicone if you just need cost tracking with zero code changes. Second, start your golden dataset before you ship. Every test case you add after a production failure is a case you wish you'd had before. Third, implement the immutability principle for prompts — a prompt versioning system doesn't have to be sophisticated, but it has to exist. The three-word prompt incident is a when-not-if.
On the rollout side, the fifteen-minute rollback test is a concrete thing you can apply right now. Can you revert a prompt change in under fifteen minutes? If not, that's the thing to fix before you scale.
And on human oversight, the decision isn't whether to have it — for anything consequential, you have to. The decision is where to put it. Synchronous approval for irreversible high-stakes actions. Asynchronous audit for reversible lower-stakes ones. And measure your escalation rate — if it's above thirty percent, something is wrong with your confidence calibration, not just your thresholds.
The thing that ties all of this together is that the demo is a single happy path. Production is the entire distribution of inputs, edge cases, adversarial users, provider outages, model drift, and budget surprises. The tooling exists to handle all of it — but you have to actually use it.
The forty percent cancellation rate is almost certainly going to materialize in organizations that treated the demo as proof the production problem was solved. It wasn't. The demo proved the model can do the task. Production proves you can run it reliably, at cost, with oversight, at scale. Those are completely different problems.
Alright, that's a wrap on this one. Big thanks to our producer Hilbert Flumingtop for keeping the ship running. And thanks to Modal for providing the GPU credits that power this show — if you're building anything that touches serverless GPU infrastructure, they're worth a look. This has been My Weird Prompts. If you haven't followed us on Spotify yet, search for My Weird Prompts and hit follow so you don't miss new episodes when they drop. Until next time.
See you then.