Here's what Daniel sent us this week. He writes: this podcast is a living experiment in using AI for learning and information, and recent episodes have been covering near real-time events, primarily the Iran-Israel War. These topics are genuinely hard from an AI perspective because language models have knowledge cutoffs and need search tooling to function. The podcast started with Tavily for search, then recently shifted to Exa.ai. When making these pipeline changes, the host runs undisclosed A/B tests, listening to episodes and noting retrieval quality differences by ear. The LLM used in every episode is already announced. Daniel's question is: how do you turn that informal listening practice into a rigorous, reproducible evaluation benchmark? What grading criteria would you use, what tools exist for this, and how do you remove subjectivity from the whole process?
That is a genuinely meaty problem. And I want to say upfront, this is not a niche concern. This is one of the hardest open problems in applied AI right now. How do you evaluate a retrieval-augmented generation pipeline on a topic where the ground truth is changing every few hours?
Right, and the specific context makes it even sharper. The Iran-Israel War is now in its sixth week. Events are moving hourly. Trump confirmed a US naval blockade of Iranian ports on April thirteenth. Iran rejected a ceasefire plan. A US service member is missing after a fighter jet was shot down. Pakistan is now proposing new talks. This is not a topic where you can compare outputs against a static Wikipedia article.
That's what makes it such a perfect stress test, actually. Any LLM without search tooling is completely blind to all of that. The war started after most knowledge cutoffs. So if your search tool fails to surface the naval blockade, or the missing pilot, or the ceasefire rejection, the episode will be factually wrong in ways that are immediately detectable. You don't need a sophisticated eval framework to catch that failure. But to catch the subtler failures, the ones where the episode sounds plausible but is actually missing critical context or has the timeline slightly wrong, that's where formal evaluation becomes essential.
So let's start with the honest diagnosis of what the current method actually is. Listening to an episode and noting whether it sounds good.
The AI evaluation community has a term for this. They call it a vibe check. And RAGAS, which is probably the leading open-source RAG evaluation library, literally defines its mission as helping teams move from vibe checks to systematic evaluation loops. The host is doing exactly what every AI team does before they get serious about evals. It's not a criticism. It's just the starting point that everyone starts from.
And by the way, today's script is powered by Claude Sonnet four point six, which is itself a fun piece of meta-context given what we're about to discuss.
It really is. We're using an LLM to write a script about how to evaluate LLMs. The recursion is appropriate. But back to the vibe check problem. There are four fundamental issues with the current approach. First, there's no isolation of variables. When the pipeline switches from Tavily to Exa, the LLM, the prompt, the topic, and the date all potentially change simultaneously. You cannot attribute quality differences to the search tool alone. Second, there's no reproducibility. You can't re-run the same episode with the other search tool on the same day with the same topic. Third, topic difficulty confounds everything. An episode covering a ceasefire rejection is inherently harder to cover accurately than an episode about a truce. The search tool may look worse simply because the topic got harder. And fourth, one episode is one data point. You need statistical power to distinguish signal from noise.
That third one is underappreciated, I think. You could switch from Tavily to Exa on a week where the news explodes in complexity, conclude Exa is worse, and be completely wrong.
Completely wrong. And that's a real risk because the Iran war has been escalating in ways that make each week harder than the last. So let's talk about how to actually fix this. The benchmark architecture I'd propose has a few distinct layers. The first is constructing what I'd call a fixed episode test set. You create a dataset of somewhere between twenty and fifty specific factual questions that a well-informed episode on the Iran-Israel War should answer correctly. Things like: what did Trump say about Iranian warships on April thirteenth? What is Pakistan's role in the current negotiations? What is Hezbollah's stated position on Lebanon-Israel talks? These become your ground truth reference outputs.
So instead of asking "does this episode sound good," you're asking "does this episode contain these specific verifiable facts."
And that shift is enormous. Facts are binary. Present or absent. No subjectivity. The moment you anchor your evaluation to specific factual questions, you've removed a huge amount of the noise from the process.
But who writes the ground truth answers? That's still a human judgment call.
It is, and you do it once, carefully, with source citations and timestamps. You write each answer with a reference URL and the timestamp of when that fact was publicly available. That's important because of what I'd call the temporal control problem. The Iran war is producing major developments every few hours. An episode recorded on April fourteenth has different correct answers than one recorded on April seventh. Your benchmark has to timestamp all ground truth facts and only evaluate against facts that were available at the time of recording. It's essentially the knowledge cutoff problem applied to evaluation itself.
Which is a beautiful piece of irony. The same problem that makes search tooling necessary for the podcast also makes evaluation of that search tooling hard.
That's the core tension in this whole domain. Now, once you have your ground truth dataset, the second layer is running both pipelines on the same query at the same time. For each episode topic, you run the full pipeline twice: once with Tavily, once with Exa, using the same LLM, the same prompt, the same timestamp. That's the variable isolation step. You're changing exactly one thing.
And this is the part that requires a bit of infrastructure investment, right? You can't just swap search tools manually after the fact.
Right, you need the pipeline to be modular enough that you can inject either search tool as a parameter. That's actually not a huge lift if the pipeline is already reasonably well-structured. LangChain and LlamaIndex both support this kind of swappable component architecture natively. Tavily integrates with both out of the box. Exa has its own LangChain integration.
Let's talk about what you're actually measuring. Because "does this episode contain the right facts" is important, but it's not the whole picture.
Not even close. The grading rubric needs to operate on three distinct levels, and this is where RAGAS really shines as a framework. The first level is retrieval quality, which is about the search tool's performance before the LLM even touches the content. The second level is generation quality, which is about what the LLM does with the retrieved content. And the third level is end-to-end quality, which is the combined system evaluated against listener outcomes.
Walk me through retrieval quality first, because I think that's where the Tavily versus Exa comparison lives.
On the retrieval side, the key metrics are context precision, context recall, noise sensitivity, and what I'd call source freshness. Context precision asks: of the documents that were retrieved, how many were actually relevant to the episode topic? Context recall asks the inverse: of all the facts that should have been in the episode, how many were actually present in the retrieved documents? Noise sensitivity measures how much irrelevant content made it through, and whether that irrelevant content confused the LLM. And source freshness measures the timestamp distribution of the retrieved articles.
That last one seems particularly decisive for this use case.
It might be the single most important metric for a breaking news podcast. Exa updates its index every hour. For a war producing major developments every few hours, that hourly freshness is potentially decisive. Tavily, as an aggregator of traditional search engines, is dependent on the freshness of the underlying sources it's pulling from. The benchmark should explicitly log the timestamp of every retrieved document and compute the mean age of retrieved articles at the time of each episode recording.
And if Exa is pulling articles from two hours ago while Tavily is pulling articles from twelve hours ago, that gap in coverage could explain a lot of quality differences that currently just read as "this episode felt better."
That's exactly the kind of insight a formal benchmark surfaces. Now on the generation side, the central metric is what RAGAS calls faithfulness. The formula is straightforward: take the number of claims in the script that are supported by the retrieved context, divide by the total number of claims in the script. That ratio is your faithfulness score. A script with a faithfulness score of point nine means ninety percent of its claims trace back to something the search tool actually retrieved. A script with a score of point six is generating forty percent of its claims from somewhere else, which in practice means the LLM is hallucinating or relying on its pre-training knowledge, which for this topic is out of date.
So a high faithfulness score is actually a signal that the search tooling is working, not just that the LLM is smart.
Precisely. If the LLM has good retrieved context to work from, its faithfulness score should be high. If the faithfulness score drops, you need to ask whether the search tool failed to retrieve relevant content, or whether the LLM is going off-piste despite having good context. Those are very different failure modes with very different fixes.
What about hallucination rate? Is that just the inverse of faithfulness?
Essentially yes. The hallucination rate is the percentage of script claims that have no source in the retrieved context. But it's worth tracking separately because hallucinations tend to cluster around specific topics. An LLM might be faithful on diplomatic facts but hallucinate military details, or vice versa. Tracking hallucinations by category lets you identify systematic weaknesses in either the retrieval or the generation step.
There's also a temporal accuracy dimension here that seems tricky. Getting the facts right but getting the sequence wrong could be just as misleading.
And for the Iran war specifically, sequence is everything. Whether Iran rejected the ceasefire before or after the naval blockade was confirmed changes the entire narrative of what's happening. So temporal accuracy, meaning are dates, timelines, and sequences correct, is a dedicated metric in the rubric. You check it against the timestamped ground truth dataset.
Okay, let's talk tools. Because we've been describing what to measure. Now, what software actually helps you do this?
There are four tools worth knowing. RAGAS is the first and most directly applicable. It's an open-source Python library built specifically for RAG evaluation. It implements all the metrics we've described, context precision, context recall, faithfulness, factual correctness, semantic similarity, and it can generate synthetic test datasets from existing documents. It supports LangChain, LlamaIndex, and Haystack integrations out of the box. For a podcast pipeline that's probably already using one of those orchestration frameworks, RAGAS drops in with minimal friction.
And it handles the LLM-as-judge component automatically?
It does. You configure it with your preferred judge model, and it handles the prompting and scoring. It also supports Vectara's HHEM-2.1-Open as an alternative to LLM-based faithfulness scoring, which is a free, open-source hallucination detection model. That's useful if you want to reduce the cost of running evaluations at scale.
What's the second tool?
LangSmith, which is LangChain's full evaluation platform. What LangSmith adds on top of RAGAS is the experiment tracking and pairwise evaluation infrastructure. Every pipeline run is logged with inputs, outputs, intermediate steps, and scores. But the feature most relevant here is pairwise evaluation. Instead of asking "how good is this script on a scale of one to ten," you ask "which of these two scripts better covers the Iran blockade?" The LangSmith documentation makes a point that's directly applicable to this podcast's situation: pairwise evaluation works well when directly scoring an output is difficult but comparing two outputs is straightforward. That is exactly the A/B test scenario we're describing.
That's a meaningful distinction. Rating a single episode is hard because you need an internal reference for what "good" looks like. Comparing two episodes is easier because you have an external reference right there.
And pairwise evaluation has another advantage: it avoids the "what does seven out of ten mean?" problem. Absolute scores are noisy because different raters have different internal scales. Relative preferences are more consistent. Exa's own evaluation research found that pairwise and pointwise rankings are highly correlated, so you can use pointwise scoring for scale and pairwise comparisons for calibration.
Third tool?
Exa's own evaluation methodology, which they published in a public blog post. This is one of the more remarkable pieces of transparency I've seen from a search company. They published their full internal grading prompt, the one they use to evaluate their own search engine, and it scores five dimensions: query relevance from zero to one, result quality from zero to one, content issues as a true or false flag, confidence from zero to one, and an overall score from zero to one. The meta-loop here is delicious. You could literally use Exa's own published grading methodology to evaluate Exa's performance in this podcast pipeline.
That is genuinely funny. "We will use your rubric to grade you."
And Exa would probably be fine with it, given that they published the rubric specifically to be used this way. Their broader evaluation philosophy is what they call open evals, meaning you define a list of queries, run them through each search provider, and grade results with an LLM. They contrast this with closed benchmarks like MS MARCO, which they argue suffer from what they call false negatives. Documents that are actually relevant but weren't labeled as such because the labelers never saw them. For a news podcast, this false negative problem is acute. The correct answer to "what happened in Iran today" is a moving target. Any benchmark built on yesterday's facts will have false negatives for today's developments. Open evals with LLM graders are the right architecture for this use case.
Fourth tool?
SimpleQA, which is an OpenAI benchmark that Exa uses to evaluate RAG quality. The approach is to pass factual questions to an LLM with search access and grade whether the LLM correctly answers them. Exa achieved the highest SimpleQA performance among search providers in their published benchmarks, outperforming Bing, Perplexity, and GPT-4o without search. On the FRAMES benchmark, which tests multi-step reasoning over retrieved documents, Exa scored fifty-four point four percent versus Perplexity at forty-four point five percent and Brave at twenty-one point six percent. For the podcast, the directly applicable version is what I'd call a PodcastQA dataset: fifty factual questions about the Iran-Israel War, each with a verified correct answer, a source URL, and a timestamp. You run both pipelines against this dataset and compare their accuracy rates.
Let's get into the subjectivity removal question, because I think that's where a lot of this becomes practically actionable for someone running a podcast pipeline.
The subjectivity problem has several layers. The most obvious is replacing "does this sound good" with structured factual verification. But there are subtler forms of subjectivity that creep back in if you're not careful. The first is grader bias. If you're using an LLM as a judge, that LLM has its own tendencies, its own preferred writing styles, its own knowledge gaps. Exa spent significant time calibrating their grading prompt to match human preferences ninety-seven percent of the time on straightforward examples and eighty-three percent on hard or ambiguous ones. That's close to inter-human agreement, but it required iteration on the prompt. The calibration process matters as much as the tool.
How do you calibrate a grading prompt?
You start by having the host review a set of episode passages manually and record their judgments. Then you run the LLM grader on the same passages and compare. Where they disagree, you examine the grader's reasoning and adjust the prompt. You add few-shot examples showing what a good versus bad passage looks like for this specific format. You iterate until the grader's outputs match the host's careful judgments on a held-out set. Once that calibration is done, you version the prompt and freeze it. Every subsequent evaluation uses the same prompt version so that scores are comparable across episodes.
There's also the problem of the grader knowing the answer already. If you ask GPT-4.1 whether a script correctly described the naval blockade, and GPT-4.1 already knows about the naval blockade from its training data, it might grade the script as correct even if the script hallucinated that fact rather than retrieving it.
This is what Exa calls the verification gap, and it's philosophically uncomfortable. You're using an LLM to grade whether another LLM correctly retrieved information the grader might already know. The solution is to use factual questions about very recent events, specifically post-knowledge-cutoff events, where the grader genuinely doesn't know the answer and must rely on the retrieved context to evaluate correctness. The Iran war is perfect for this because it started after most knowledge cutoffs. A grader evaluating whether the script correctly describes the April thirteenth naval blockade announcement has no prior knowledge to fall back on. It has to reason from the retrieved evidence.
Which means the benchmark gets more reliable the more cutting-edge the topic is. There's something counterintuitive about that.
It flips the usual assumption that recent events are harder to evaluate. For this specific evaluation architecture, recent events are actually easier to evaluate cleanly because you've removed the grader's prior knowledge as a confounding variable.
What about the blind evaluation piece? Because the host is currently running undisclosed A/B tests, which is a good instinct, but it needs formalization.
The blind piece is essential and it's currently informal. Formalizing it means using randomized identifiers when comparing outputs. You label scripts as Episode A and Episode B, not as Tavily episode and Exa episode. The grader, whether human or LLM, never sees which search tool generated which script. This prevents both conscious and unconscious bias toward whichever tool the host currently prefers or expects to perform better.
And for the LLM grader, the label isn't visible anyway, so it's naturally blind. The human spot-check is where the unblinding discipline matters.
Right. Exa's methodology recommends that humans manually review about ten percent of graded examples to validate the LLM grader's calibration. When you're doing that manual review, you need to be looking at blinded outputs. Otherwise you're defeating the purpose.
Let's talk about the multi-rater piece, because I think running multiple LLMs as graders is an underappreciated robustness check.
Exa runs their grading with GPT-4o, GPT-4o mini, GPT-4.1, and Gemini Flash two point five, and they found high agreement across all four on rankings. That cross-model consistency check is valuable because it tells you whether your evaluation is measuring something real or just capturing the idiosyncrasies of one particular grader model. If GPT-4.1 says Exa outperforms Tavily on source freshness and Claude three point five Sonnet agrees, you have much higher confidence in that finding than if only one grader produces that result.
There's also the statistical significance piece, which I feel like gets glossed over in most informal evaluations.
It gets completely ignored in vibe-check evaluations. Exa's methodology is explicit about this: don't conclude that one search tool is better from three episodes. Report confidence intervals around all scores, not just point estimates. Use at least fifty to one hundred queries for reliable comparisons. The Anthropic paper on statistical approaches to model evaluation, which Exa cites in their methodology, recommends clustered error reporting. For the podcast, this means you need to run the benchmark across many episodes before drawing conclusions about Tavily versus Exa. The current A/B test sample size is almost certainly too small to be statistically meaningful.
Let's talk about what a fully realized version of this benchmark looks like, because I want to give people a concrete picture of the end state.
Here's the architecture I'd call PodcastRAG-Eval. You start with a dataset of fifty factual questions about the Iran-Israel War, covering events from March through April of this year. Each question has a verified correct answer with a source URL and a timestamp. Questions span military events, diplomatic developments, economic impacts, and humanitarian consequences, and they're categorized by difficulty: easy for widely reported facts, medium for questions requiring synthesis across sources, hard for very recent or niche developments.
And the pipeline variants you'd test against that dataset?
At minimum, four combinations: Tavily with GPT-4o, Tavily with Claude, Exa with GPT-4o, Exa with Claude. Plus a no-search baseline, the LLM running on its own without any retrieval, to establish how much the search tooling is actually contributing. That baseline is important because it tells you whether you're measuring search quality or just LLM quality.
The metrics you'd report across all of those?
Seven core metrics. Factual accuracy rate, meaning the percentage of ground truth questions correctly answered in the script. Faithfulness score using the RAGAS formula of supported claims divided by total claims. Context precision, meaning the percentage of retrieved documents that were actually relevant. Source freshness, meaning the mean age of retrieved articles at the time of recording. Hallucination rate, meaning the percentage of script claims with no source in the retrieved context. Completeness, meaning the percentage of ground truth questions covered at all, even if answered incorrectly. And pairwise win rate, meaning the head-to-head comparison of Tavily versus Exa scripts on the same topic.
That completeness metric is interesting. You could have a script that gets everything it covers right but misses half the important developments entirely.
And that's a different failure mode from hallucination. A script with high faithfulness and low completeness is being careful but narrow. A script with high completeness and low faithfulness is being comprehensive but sloppy. You want both high. The tension between them is actually a useful signal about how the LLM is balancing breadth and depth given the retrieved context.
There's a practical takeaways question here that I want to get to, because some of this is theoretically elegant but the implementation path matters. What does someone actually do first?
The first step is the one that costs nothing: build the ground truth dataset. Write fifty factual questions about the Iran-Israel War with verified answers, source URLs, and timestamps. That dataset is the foundation everything else builds on. You can start running informal evaluations against it immediately, even before you've set up RAGAS or LangSmith, just by manually checking whether scripts contain the right answers.
And the second step?
Make the pipeline modular. Ensure that the search tool is a swappable parameter so you can run the same topic through Tavily and Exa in the same session. This is the variable isolation step that turns your current informal A/B test into a controlled experiment. If the pipeline is already using LangChain or LlamaIndex, both Tavily and Exa have native integrations, so the engineering lift is relatively small.
Third step?
Add logging. Every pipeline run should log the LLM used, the search tool used, the episode date, the topic category, the number of search queries issued, the latency, and the full list of retrieved documents with their timestamps. The podcast already announces the LLM used in every episode, which is excellent metadata. Extending that to include the search tool and the retrieved source timestamps gives you the provenance data you need for meaningful analysis.
The podcast announces the LLM, which is actually a really interesting commitment. Because it means listeners can track quality changes over time even informally.
And it's the kind of transparency that makes formal evaluation possible in retrospect. Exa's evaluation philosophy makes a point I find compelling: there is still no substitute for manually running a few queries yourself. Evals, especially LLM-graded ones, provide a narrow and sometimes biased window. The benchmark doesn't replace the host's judgment. It supplements it with structure, reproducibility, and statistical power. The host's ear is still the north star calibration for what good sounds like. The benchmark is the instrument that lets you measure systematically toward that north star.
What about the longer-term possibility of using the podcast's existing archive as a benchmark dataset? Because there are over two thousand episodes at this point. That's a lot of data.
That's actually a fascinating angle. Every episode ever produced is a data point. You could retroactively build a benchmark by identifying all the factual claims made in past episodes, fact-checking them against contemporaneous news sources, and computing a historical accuracy rate per LLM and per search tool. That would be the most authentic possible evaluation because you'd be using the podcast's own output as the test set. The challenge is that it requires significant manual effort to build the ground truth for historical episodes, and the topics vary enough that you'd need to be careful about comparability. But for the Iran war specifically, where you have a coherent topic across many recent episodes with a clear factual record, it's very doable.
And the metadata is already there. You know the LLM used for each episode. You'd just need to add the search tool used, if that wasn't already logged.
Which is why starting to log that now, even before you have a full eval framework in place, is worth doing. The data you capture today becomes the historical benchmark you can run retrospectively in six months. The infrastructure investment compounds.
There's one more angle I want to hit, which is the downstream task philosophy that Exa articulates. Because I think it reframes the whole question.
This is their most interesting evaluation idea. The argument is that the ultimate test of a search engine is not whether it returned relevant documents, but whether it helped the LLM accomplish the downstream task. For this podcast, the downstream task is: did a listener learn accurate, current information about the Iran-Israel War? And that is, at least in principle, directly measurable. You could survey listeners before and after an episode and test their factual knowledge. You could track which claims from the episode appear in listener social media posts and check whether those claims are accurate. These are harder to implement than RAGAS scores, but they're the north star metrics that all the intermediate metrics are proxies for.
The gap between "the retrieval was technically good" and "the listener came away with accurate beliefs" is where a lot of value gets lost or created.
And it's a gap that's easy to ignore when you're focused on the pipeline mechanics. A script can have high faithfulness, meaning all its claims trace back to retrieved sources, and still mislead listeners if the retrieved sources were themselves misleading or if the script's framing emphasized the wrong things. The downstream task evaluation catches failures that no upstream metric would flag.
Alright, let's land this. What's the summary of what someone should actually do?
Build the ground truth dataset first, fifty factual questions with verified answers and timestamps. Make the pipeline modular so you can run controlled A/B tests with one variable changing at a time. Add logging for search tool, retrieved sources, and timestamps. Implement RAGAS for the core metrics, faithfulness, context precision, context recall, and factual correctness. Use LangSmith's pairwise evaluation infrastructure to compare Tavily and Exa outputs on the same topics. Calibrate an LLM grader against the host's own judgments on a small set of examples, then version and freeze the grading prompt. Run multiple LLM graders to cross-check consistency. Report confidence intervals, not just point estimates, and don't draw conclusions from fewer than ten to twenty episodes per comparison. And keep the host's manual review in the loop for ten percent of graded examples to catch grader drift.
The meta-point being that the vibe check doesn't go away. It gets supplemented.
The vibe check is your calibration signal. The benchmark is your measurement instrument. You need both.
This has been a genuinely satisfying one to work through. The specificity of the use case, a news podcast covering a live war, makes all the abstract evaluation theory land concretely.
And it's a problem that generalizes. Any RAG pipeline covering fast-moving topics faces the same evaluation challenges. The Iran war just makes the stakes visible in a way that clarifies why the formalization matters.
Thanks as always to our producer Hilbert Flumingtop for keeping the whole operation running. Big thanks to Modal for providing the GPU credits that power this show. If you want to get notified when new episodes drop, search for My Weird Prompts on Telegram. This has been My Weird Prompts. We'll see you next time.
Take care.