So Daniel sent us this one, and it's squarely in the weeds of agentic AI development. The question is essentially: why do standard benchmarks fall apart when you're building bespoke agentic systems, and what does it actually look like to build your own evaluation framework from scratch? He wants to walk through the whole lifecycle, creating criteria, running model comparisons, automating scoring, and then keeping the thing alive as your requirements shift. The tension at the heart of it is counterintuitive: in a world with more benchmarks than ever, the right move for serious workloads is often to throw them all out and start over.
And that tension is real. I've been reading through some of the recent evals literature and the honest practitioners, the ones actually shipping agentic systems, almost uniformly say the same thing: public benchmarks told them almost nothing useful about production behavior.
Which is a pretty damning statement given how much weight gets placed on leaderboard scores.
It is. And it makes sense once you understand what public benchmarks are actually optimized for. They're optimized for comparability across models, across research groups, across time. That's a legitimate goal. But comparability requires standardization, and standardization means the tasks have to be generic enough that any model can attempt them. The moment you have a specific domain, a specific tool ecosystem, a specific failure profile you care about, generic tasks stop predicting anything meaningful.
So the benchmark is measuring something real, just not the thing you need measured.
Precisely that. And for agentic systems the gap is even wider than for, say, a text classification model. Because agentic systems aren't just producing outputs, they're taking sequences of actions. They're calling tools, deciding when to stop, deciding when to ask for clarification, deciding when they're uncertain enough that they should escalate rather than proceed. None of the standard benchmarks are testing that decision architecture in any realistic way.
Let me push on the word "agentic" for a second, because it gets used to mean everything from a simple tool-calling wrapper to a fully autonomous multi-step reasoning system. Does the argument for custom benchmarks apply across that whole spectrum, or is it really about the more complex end?
Good push. I'd say the argument intensifies as autonomy increases, but it applies even at the simpler end. A system that does nothing but retrieve documents and summarize them is technically not very agentic, but if you're deploying it for, say, legal due diligence, the failure profile is highly domain-specific. A generic summarization benchmark is not going to tell you whether the model hallucinates citations or misrepresents the scope of a contractual clause. So even there, you need custom evals.
The stakes just get higher as autonomy increases.
Much higher. When a model is making multi-step decisions and each step feeds into the next, errors compound. A small systematic bias in step one can produce wildly wrong outputs by step five. And the interesting thing is that compounding error behavior is almost invisible in single-turn benchmarks. You'd never catch it.
By the way, today's episode is being written by Claude Sonnet four point six. Just worth noting.
Our friends down the road. Appreciated. Okay, so let's actually get into the construction. If I'm a developer or a team lead who's convinced that I need a custom benchmark, where does this actually start? Because I think a lot of people know they should do it and then just... don't.
They do the thing where they convince themselves the public benchmarks are probably fine enough.
And ship something that falls apart in production three weeks later. So step one is evaluation criteria, and this is the hardest part. Harder than the tooling, harder than the scoring. Because you have to answer a question that sounds simple: what does good look like for this specific system?
Which requires you to actually know what the system is for.
Which sounds obvious but often isn't. I've seen teams who could describe their system's technical architecture in great detail but couldn't articulate what a successful output meant beyond "the user is happy." That's not evaluable. You need to decompose it.
How do you decompose it?
You start with the workload. What are the actual tasks this system performs? Not in the abstract, but concretely. If it's a customer support agent, is it handling billing disputes? Technical troubleshooting? Account changes? Each of those has different success criteria. A billing dispute resolution might be: did the agent correctly identify the disputed amount, did it apply the right policy, did it resolve without escalation when the issue was within policy bounds. That's three separate evaluable dimensions right there.
And you're saying you write those down before you run a single model.
Before you run a single model. This is the discipline that most teams skip. They reach for a model, they start prompting, they eyeball the outputs, and they develop a vague intuition for whether it's working. That intuition is real but it's not transferable, it's not reproducible, and it doesn't tell you anything about the system's behavior at the tail of the distribution.
The tail being where all the interesting failures live.
Always. The median case usually works fine. It's the edge cases, the ambiguous inputs, the adversarial inputs, the inputs that are just slightly outside the training distribution, that's where you discover whether you actually have a system you can trust.
So you've got your workload decomposed, you've got dimensions. What does a well-formed evaluation criterion actually look like? Give me a concrete example.
Okay. Let's say you're building an agentic system for medical billing code selection. Not prescribing, just coding. The system reads a clinical note and selects the appropriate diagnostic codes. A poorly formed criterion is: "selects the correct code." A well-formed criterion has several components. First, the input specification: what kinds of notes, from what specialties, with what levels of completeness. Second, the output specification: primary code, secondary codes, confidence indication. Third, the success condition: exact match on primary code, or acceptable match within the same category, and you have to decide that. Fourth, the failure taxonomy: wrong category is worse than wrong specificity within the right category. Those are different failure modes with different downstream consequences.
And that failure taxonomy is doing a lot of work. You're not just counting right and wrong, you're weighting them.
Weighting by consequence. Which is the only weighting that matters for a deployed system. A code that's wrong by one specificity level might generate a billing adjustment. A code that's wrong by category might trigger a fraud audit. Those are not equivalent errors and a flat accuracy score treats them as equivalent.
I want to flag something here because I think this is where a lot of teams get stuck. Building that failure taxonomy requires domain expertise that the ML team often doesn't have. Someone has to know that a category-level billing error is worse than a specificity-level error. That's a medical billing expert, not an engineer.
This is a really important structural point. The best custom benchmarks I've seen described in the literature are built collaboratively between domain experts and the technical team. The domain experts define the failure taxonomy and the consequence weights. The technical team translates those into evaluable, automatable criteria. Neither group can do it alone.
And neither group always wants to do it together, in my experience.
Ha. No. There's friction. Domain experts often don't think in terms of systematic evaluation. They think in terms of specific cases they've seen go wrong. Which is actually useful, those cases become your hard test set. But you have to structure it.
Okay, so you've got your criteria. You've got your failure taxonomy, your consequence weights, your input specifications. Now you're running evaluations with different models. What does that process actually look like?
So this is step two, and there are a few things that separate rigorous model comparison from the kind of informal "let's try GPT and Claude and see which feels better" approach that a lot of teams default to.
The vibe-based evaluation.
Vibe-based evaluation, yes, which has an embarrassingly large influence on production decisions. The first discipline is holding your test set fixed. You construct your evaluation dataset before you start comparing models, and you don't change it based on how models perform. This sounds obvious and it is constantly violated. Teams will run a model, see it fail on certain cases, decide those cases were "unfair," remove them, and then declare the model passes. That's benchmark contamination by another name.
You've just optimized for the benchmark instead of the behavior.
And you've made the benchmark useless. The second discipline is separating your development set from your evaluation set. You use the development set to tune prompts, to adjust the system architecture, to iterate. The evaluation set is held out and you only run against it when you're making an actual decision. If you tune against your evaluation set, you're overfitting to it and you won't know until production.
What's the right ratio there? Development to evaluation?
It depends heavily on how rare your hard cases are. For a system where the interesting failures are rare events, you might need a much larger evaluation set than you'd initially think. I've seen guidance suggesting that if your target failure rate is one percent, you need at least a few hundred evaluation examples just to detect a statistically meaningful difference between two models. If you're comparing models on a fifty-example set, most of what you're seeing is noise.
That's a sobering number for teams that think they've done rigorous evaluation because they ran thirty examples.
Thirty examples tells you approximately nothing about tail behavior. It tells you about the center of the distribution and that's it. Now, for agentic systems specifically, there's another layer of complexity which is that you're not just evaluating outputs, you're evaluating trajectories. The sequence of tool calls, the intermediate reasoning steps, the points where the agent decides to ask for clarification versus proceeding. Those intermediate behaviors matter enormously for reliability and they're essentially invisible if you only score final outputs.
How do you evaluate a trajectory? That feels like it requires a lot of instrumentation.
It does require instrumentation, but it's tractable. You log every step, every tool call, every intermediate state. Then you define trajectory-level criteria: did the agent attempt the right tools in a reasonable order, did it handle a tool error gracefully, did it avoid redundant calls that would inflate latency and cost. Some of those you can score automatically, some you need human review for, at least initially.
The cost dimension is interesting. Most benchmark discussions are purely about accuracy. But for production agentic systems, cost per task is a real constraint.
Critical constraint. And it's something you should be tracking in your custom benchmark from day one. A model that achieves ninety-two percent task success but costs four times as much per run as a model achieving eighty-eight percent might not be the right choice depending on your workload volume and your error tolerance. You need both numbers to make that decision.
So your benchmark output isn't a single score. It's a vector.
It's a vector, and ideally you're plotting it in a way that makes the tradeoffs visible. Accuracy versus cost. Accuracy versus latency. Accuracy on common cases versus accuracy on edge cases. Different deployment contexts will weight those differently. A low-volume, high-stakes system might accept high cost for high accuracy. A high-volume, lower-stakes system might optimize heavily for cost.
Let's get into automated scoring, because this is where the practical rubber meets the road. You've got a test set, you've run your models, you've logged trajectories. How do you actually score at scale without having a human review every output?
This is where LLM-as-judge has become useful, and also where it introduces its own set of problems that you have to manage carefully. The basic idea is that you use a separate language model to evaluate the outputs of your agent. You give the judge model the input, the expected output or a rubric, the agent's output, and you ask it to score along your defined dimensions.
Which creates an obvious question: who evaluates the evaluator?
And this is not a trivial question. LLM judges have known biases. They tend to favor longer outputs over shorter ones even when the shorter output is more correct. They favor outputs that sound confident. They have positional biases, meaning they rate the first option in a comparison higher more often than chance would predict. If you're not controlling for these, your automated scoring is systematically distorted.
How do you control for them?
A few practices. For positional bias, you run each comparison twice with the order swapped and average the scores. For length bias, you include explicit instructions in your judge prompt to evaluate accuracy and relevance independent of length, and you validate that by spot-checking cases where the shorter output was clearly better. For confidence bias, similar explicit instructions plus validation. And critically, you maintain a human-labeled calibration set. You take a hundred or so examples, have domain experts label them, and periodically check that your automated judge agrees with human judgment at a rate you consider acceptable.
What rate is acceptable?
Depends on the stakes. For a low-stakes application, eighty-five percent agreement with human labels might be fine. For something like medical coding or legal document analysis, I'd want ninety-five percent or better before I trusted the automated scores to drive production decisions. And you should be reporting that calibration number alongside your benchmark results. A benchmark with ninety-eight percent automated scoring agreement means something very different from one with seventy-nine percent.
There's also the question of what model you use as the judge. Using the same model to evaluate itself is obviously problematic.
Obviously problematic and surprisingly common. You want your judge to be a different model from the one you're evaluating, and ideally a model with known strengths in the domain you're evaluating. For technical accuracy, a model with strong coding or reasoning capabilities. For language quality, a model with strong language generation. It's not always possible to have the perfect judge, but you should at least avoid the obvious conflict of interest.
I want to step back for a second and name something that I think is underappreciated in this whole discussion. Building a good custom benchmark is itself a significant engineering and organizational investment. You're talking about domain expert time, engineering time for instrumentation and tooling, ongoing calibration. For a lot of teams, especially smaller ones, this feels like it's competing with shipping the actual product.
It's a real tension and I don't want to be glib about it. But I think the framing of "benchmark investment versus shipping" is actually backwards. The teams I've seen skip rigorous evaluation and ship fast have, with remarkable consistency, spent more total time on debugging production failures, rebuilding trust with users, and doing emergency patches than they would have spent on proper evaluation upfront. The benchmark cost is a known, bounded investment. The cost of discovering your system has a systematic error in production is unknown and often very large.
The classic "pay now or pay later" situation, except the later payment comes with interest and embarrassment.
Significant interest. And for high-value workloads specifically, the equation is even clearer. If your agentic system is making decisions that affect revenue, compliance, patient outcomes, legal exposure, the cost of a systematic evaluation failure isn't just technical debt. It's actual liability.
Okay, let's talk about the part that I think is underserved in most discussions of benchmarking: maintenance. You've built this thing, it's running, it's giving you good signal. How do you keep it from going stale?
This is the part that separates teams that actually use their benchmarks from teams that built one, were proud of it for a month, and then quietly stopped trusting it. Benchmarks go stale in a few distinct ways and you need to be watching for each of them.
Walk me through them.
The first is requirements drift. Your product evolves. New features get added, the scope of the system expands, user behavior reveals use cases you didn't anticipate. If your benchmark doesn't track those changes, you're evaluating against a spec that no longer exists. The fix is treating benchmark updates as a formal part of your product development process. When a new requirement is added to the system, a corresponding evaluation criterion gets added to the benchmark. That has to be a policy, not an aspiration.
It has to be someone's job.
Someone's job, or at minimum someone's explicit responsibility. The second way benchmarks go stale is data drift in the test set itself. This is more subtle. The inputs your system sees in production change over time. User language evolves. The documents being processed change in character. New edge cases emerge that weren't in your original test set. If you're not periodically refreshing your test examples with real production samples, your benchmark is measuring performance on a distribution that no longer matches your actual workload.
How do you sample from production without contaminating your evaluation set? Because presumably you don't want to expose real user data to the evaluation pipeline.
Great question. You need a process for anonymizing and curating production samples before they enter the evaluation set. Specifically, you sample cases that the system handled with low confidence, cases that triggered human review, cases that generated user complaints or corrections. Those are your highest-signal additions to the test set because they represent the system's current frontier of difficulty. You anonymize, you have domain experts label them, and you add them to the rotation.
That's actually a really elegant feedback loop. The system's own uncertainty flags the cases that most need to be in the test set.
And it means your benchmark continuously adapts to reflect the actual hard cases rather than the hard cases you imagined when you first built it. The third staleness vector is model improvement. Models get updated. The model you're using in production in January might be a meaningfully different model in July. If you're comparing against a baseline from six months ago, you might be comparing against a version of the model that no longer exists.
Which means your benchmark results aren't comparable over time.
Not without careful versioning. You need to log which model version was evaluated, with what system prompt, with what tool configuration. Because even a minor prompt change can shift performance significantly on specific dimensions. Reproducibility requires that level of specificity. If you can't reproduce a benchmark run from three months ago, you've lost the ability to meaningfully track improvement.
This is making me think about something. You've described the benchmark as a living document, essentially. It gets updated as requirements change, as new production cases come in, as model versions change. But that creates a tension with comparability. If the benchmark is changing, how do you know whether an improvement in your score represents the system actually getting better or just the benchmark getting easier?
This is a hard problem and I'll be honest that there's no fully satisfying solution. The best practice I've seen is to maintain a frozen core. A subset of your evaluation set, maybe twenty to thirty percent, that you never modify. It represents a stable set of foundational requirements that should always be met. You add to the benchmark freely, you update the peripheral cases, but the frozen core stays fixed. That gives you a stable longitudinal signal even as the broader benchmark evolves.
The frozen core as your historical anchor.
Your historical anchor. And you report scores on the frozen core and the full benchmark separately. If your frozen core score is going up and your full benchmark score is going up, you're improving. If your frozen core score is flat but your full benchmark score is going up, you might just be expanding into easier territory. If your frozen core score is going down while your full benchmark score is going up, you have a regression problem and you need to investigate immediately.
That decomposition is really useful. It's essentially a way of distinguishing genuine capability improvement from benchmark gaming.
Which is a problem even in your own internal benchmark if you're not careful. The incentive to show progress can subtly influence which cases get added to the test set. The frozen core removes that lever. Let me also mention one more maintenance practice that I think is underutilized: adversarial example injection. Periodically, deliberately, add examples that are designed to catch specific failure patterns you're worried about. Not just edge cases from production, but synthetic cases that probe known weaknesses. If you're worried about the system being overconfident on ambiguous inputs, write cases that are specifically designed to be ambiguous and verify that the system flags uncertainty appropriately.
That's essentially red-teaming your own benchmark.
Red-teaming your benchmark, yes. And it keeps the benchmark honest. A benchmark that only contains cases the system has seen before, or cases drawn from the distribution the system was designed for, will systematically underreport the system's brittleness.
I want to zoom out for a second because I think there's a broader epistemological point here that's worth naming. What we're really talking about is the relationship between measurement and trust. A custom benchmark, done well, is a claim: "we know what this system does, we know where it works and where it doesn't, and we have evidence for both." That's a very different posture than "we ran it on some benchmarks and it scored well."
The posture of epistemic accountability. And I think for agentic systems specifically, that posture is going to become a regulatory and contractual expectation, not just a best practice. If you're deploying an agentic system in a regulated domain, the question of "how do you know it works" is going to need a better answer than "it scored well on public benchmarks."
We're already seeing that in some verticals. Medical device software has had rigorous performance validation requirements for years. Financial model risk management frameworks require documented backtesting and ongoing monitoring. The agentic AI space is going to converge on similar expectations.
And the teams that have already built rigorous custom evaluation frameworks are going to be in a much better position when those expectations arrive. Not just because they'll have the documentation, but because they'll have developed the organizational muscle for systematic evaluation. That's not something you can stand up quickly when a regulator asks for it.
Let's talk practical takeaways, because I want listeners to be able to do something with this. If you're a developer or a team lead who's building an agentic system and you've been doing informal evaluation, what are the concrete first steps?
I'd say there are three things you can do this week. The first is to write down your failure taxonomy. Not your success criteria, your failure taxonomy. What are the ways this system can fail, and what are the consequences of each failure type? Sit down with someone who has domain expertise and spend two hours on this. You will learn things about your own system that you didn't know. That document becomes the foundation of your evaluation criteria.
And it's useful even before you formalize anything. Just having that conversation surfaces assumptions that have been floating around unexamined.
The second thing is to pull your last fifty production outputs that triggered some kind of issue: a user complaint, a correction, a manual review, an escalation. Label them according to your failure taxonomy. You now have a starter test set that's grounded in actual production behavior, not hypothetical cases. That's more valuable than a hundred synthetic examples.
Fifty real failures beat a hundred imagined ones.
Every time. The third thing is to set up basic trajectory logging if you haven't already. Every tool call, every intermediate state, every confidence signal the system produces. You don't need to analyze it immediately. But you can't analyze what you haven't captured. The logging infrastructure is the prerequisite for everything else, and it's much harder to retrofit than to build in from the start.
What about teams that are just starting, haven't deployed yet? Is there a version of this for greenfield development?
For greenfield, the most valuable thing you can do is involve domain experts in defining evaluation criteria before you write your first prompt. Not after you have a prototype you want to validate, but before. The discipline of writing down what good looks like forces clarity about what you're actually building. I've seen teams that went through that process discover that their initial system design was solving the wrong problem. Better to discover that at week one than at week twelve.
The benchmark as a design tool, not just a measurement tool.
That reframing is important. If your evaluation criteria are clear, they constrain and guide your system design. They tell you which capabilities are load-bearing and which are nice-to-have. They tell you where you need to invest in reliability and where good-enough is actually good enough. A well-designed benchmark isn't just measuring your system, it's shaping it.
There's something almost philosophical about that. The act of defining what you're measuring changes what you build.
And it changes what you pay attention to during development. When you have explicit trajectory criteria, you start noticing when your agent is taking inefficient paths through a task. When you have explicit uncertainty criteria, you start noticing when it's proceeding confidently on inputs it should be flagging. The benchmark makes the invisible visible.
I think we've covered the arc pretty thoroughly. The case against public benchmarks for bespoke agentic systems. Building evaluation criteria with the right people in the room. Running rigorous model comparisons with proper test set discipline. Automated scoring with calibrated judges. And the maintenance practices that keep the whole thing useful over time.
One thing I'd leave listeners with, and this is maybe the meta-point: the goal of all of this is not to have a benchmark. The goal is to have justified confidence in your system's behavior. The benchmark is just the instrument. And like any instrument, its value depends entirely on whether it's measuring what you actually care about. A well-calibrated thermometer tells you the temperature. A well-calibrated benchmark tells you whether your system is doing what you need it to do in the conditions you'll actually deploy it. That's the only thing it's for.
And if it's not doing that, it's just a number that makes you feel better about a decision you've already made.
Which is worse than no benchmark at all, because it's false confidence.
On that cheerful note. Big thanks to Hilbert Flumingtop for producing the show, as always. And thank you to Modal for keeping our compute pipeline running. If you're spinning up evaluations at scale, serverless GPU infrastructure is not a place to compromise. Check them out. This has been My Weird Prompts. If you want to find all two thousand one hundred and seventy-four episodes, myweirdprompts.com has the full archive. Leave us a review if this one was useful to you.
See you next time.