So Daniel sent us this one, and I want to read it out because it frames the problem really precisely. He writes: "How do you know an AI wargame or policy simulation is good? The brutal evaluation problem. You cannot compare to ground truth because the event hasn't happened — and if it has, the simulation is post-hoc. So what do you actually measure? Walk through the candidate methodologies and where each one breaks: backtesting on historical crises, inter-run consistency, expert red-teaming, predictive calibration on proxy events, and process validity. The brutal honesty: most published LLM simulation projects skip evaluation entirely, or do a token 'we asked some experts' pass. This is the field's biggest credibility problem." So. Yeah. That last line is doing a lot of heavy lifting.
It really is. And the timing here is sharp — DARPA released their Game Changer simulation platform in January, which is basically an institutional signal that these tools are moving out of research labs and into actual policy planning rooms. So the evaluation question stops being academic really fast.
Right, because there's a meaningful difference between a researcher at Stanford running a thousand simulations for a paper and a defense analyst feeding outputs into a briefing that informs real decisions. The stakes of "is this thing actually measuring anything" go up considerably.
And that's exactly the framing that makes this problem brutal. With most AI systems, you have ground truth. You test a medical diagnostic model against confirmed diagnoses. You test a translation model against verified translations. The feedback loop is tight. With a geopolitical simulation, you're asking: will this crisis escalate? And you won't know the answer for months, years, or ever — because the whole point is to inform decisions that might prevent the scenario from playing out.
Oh, and by the way — today's script is being generated by Claude Sonnet 4.6, so the AI writing our words about AI evaluation is a nice layer of recursion to sit with.
Meta in the best possible way. Okay, so let's actually walk through the five methodologies Daniel laid out, because each one fails in a different and instructive direction. The failures aren't random — they each reveal something structural about the evaluation problem.
Start with backtesting, because on the surface it sounds completely reasonable. You have historical crises with known outcomes. Run the simulation on the Cuban Missile Crisis, the nineteen seventy-three Yom Kippur War, the twenty fourteen Crimea annexation. If the model gets close to what actually happened, that's validation, right?
That's the intuition, and it's wrong in a way that's almost embarrassing once you see it. The problem is temporal contamination. These LLMs were trained on the entire internet, which includes fifty years of post-hoc analysis of the Cuban Missile Crisis. Every memoir, every declassified document, every political science paper dissecting Khrushchev's decision calculus — it's all in the training data.
So when you ask the model to simulate the Cuban Missile Crisis, it's not actually simulating anything. It's pattern-matching to the answer it already knows.
There's a GPT-4 example that illustrates this perfectly. When researchers ran it on the Cuban Missile Crisis scenario, the model reproduced almost textbook-accurate analysis — the thirteen days framing, Kennedy's EXCOMM deliberations, the Khrushchev backchannel. It looked like brilliant strategic reasoning. But what it was actually doing was compressing and reciting analysis it had ingested during training. There was no genuine inference happening, no counterfactual reasoning. It was retrieval dressed up as simulation.
Which is genuinely hard to detect from the outside, because the output looks exactly like what you'd want a good simulation to produce.
That's the insidious part. High face validity, zero construct validity. The distinction matters enormously. Face validity means the output looks plausible and professional. Construct validity means the output is actually measuring what you claim to be measuring — in this case, strategic reasoning under uncertainty. Most current LLM wargames have the first and can't demonstrate the second.
So what's the fix? Because researchers must have noticed this problem.
There's a promising direction in an ICLR twenty twenty-six submission proposing what they call rapid backtesting without temporal contamination. The idea is to use very recent, relatively niche events that occurred after the model's training cutoff. Events the model genuinely hasn't seen analyzed to death. You're testing actual forecasting capability rather than recall.
That's clever. Though it creates its own problem — you're limited to testing on low-profile recent events, which may not have the strategic complexity you actually care about.
Right, you're trading contamination for representativeness. The events you can cleanly test on are probably not the events that matter most for policy. So backtesting as a methodology survives in modified form, but it can't carry the validation burden alone. Which brings us to the second methodology: inter-run consistency.
This one feels like it's solving a different problem than the one we care about.
It is, and being clear about that distinction is important. The method is straightforward: run the same scenario a thousand times, vary the temperature settings, and see how the outcomes cluster. If ninety percent of runs end in a negotiated settlement, that tells you the simulation is stable — the outputs aren't chaotic.
But stable and correct are completely different things.
Completely. Consistency measures precision, not accuracy. You can have a simulation that produces the same wrong answer a thousand times in a row, and by the consistency metric, it looks great. What researchers call the echo chamber problem — the model is stuck in a local minimum of its own reasoning patterns.
What does that actually look like in practice? What's a systemic bias that would cause consistent wrongness?
The hawk bias is the clearest example. There's solid evidence from Rivera and colleagues' twenty twenty-four paper — the seminal work on escalation risks from LLMs in military decision-making — that GPT-4 Base, without alignment fine-tuning, was shockingly warlike. The model would opt for escalatory actions, including in some scenarios nuclear options, with justifications that read almost like they were drawn from action thriller novels. "We have it, let's use it" was an actual output. And it was consistent. Run after run, the model escalated. High consistency, catastrophically wrong.
That's a remarkable finding. And it points to something about the training data, not just the model architecture.
The training data is full of dramatic narratives — fiction, news coverage that emphasizes conflict, historical accounts that foreground escalation because those are the stories that get written about. The base model absorbs those patterns. Without heavy alignment work, it defaults to the most dramatically coherent sequence of events, which in military scenarios tends toward escalation.
So consistency checks are useful for catching chaos — for verifying the simulation is at least deterministic enough to be interpretable — but they tell you nothing about whether the model's strategic preferences are calibrated to reality.
That's the right framing. Consistency is a necessary condition, not a sufficient one. You need it, but passing that bar doesn't mean anything beyond "this simulation isn't random noise." Now, the third methodology is where most of the field currently lives, because it's the most natural thing to reach for: expert red-teaming.
Have some smart people read the outputs and tell you if they seem plausible.
Which sounds reasonable, and in the absence of better options, it is the gold standard. But it has a fundamental vulnerability that a twenty twenty-four study exposed in painful detail. The study compared two hundred fourteen national security experts to LLM outputs on wargame scenarios. The finding was that while LLMs sounded like experts — confident, precise, appropriately hedged — their underlying strategic preferences were fundamentally different from what actual experts would choose.
Different in what direction?
LLMs were far more sensitive to the framing of command instructions. Change the wording of the prompt slightly, and the strategic preference shifted dramatically. Human experts, by contrast, showed much more robust preferences — they were responding to the substance of the scenario, not the surface-level framing. And LLMs were far less sensitive to the fog of war. Experts modeled uncertainty, second-guessed information, factored in what they didn't know. LLMs tended to treat the scenario information as complete and authoritative.
So the LLM sounds like an expert but reasons like a very different kind of system.
And the problem for expert review is what gets called the eloquence trap. Experts reading the transcripts are often swayed by the confident, professional tone of the output. The LLM writes like a senior analyst, uses the right vocabulary, structures its reasoning in familiar ways. That surface fluency biases the reviewer toward positive assessment even when the underlying reasoning is off.
Which is a problem that gets worse as models get better at sounding authoritative.
It scales in the wrong direction. The more fluent the model, the harder it is for expert reviewers to separate "this sounds right" from "this is right." And there's the additional issue that expert review is expensive, slow, and hard to standardize. The Brookings Institution ran a simulation of Iran-Israel escalation scenarios in twenty twenty-five, and the entire validation framework consisted of three experts rating transcripts on a subjective plausibility scale. Three people. For a simulation that was presumably informing policy-adjacent analysis.
Three people is not a methodology. Three people is a vibe check.
It's a credentialed vibe check, which is the field's current gold standard. That should tell you something about where the baseline is. And this is the brutal honesty that Daniel flagged in his prompt — most published LLM simulation work either skips evaluation entirely, or does exactly this: gestures at expert opinion without any systematic framework for what the experts are evaluating, what criteria they're applying, or how their individual biases might be affecting their ratings.
Let's talk about the fourth methodology, because this one feels like it's trying to do something genuinely clever by sidestepping the ground truth problem.
Predictive calibration on proxy events. The logic is: if you can't validate the big scenario because it hasn't happened, validate the model's forecasting ability on smaller events that will happen. Short-horizon predictions — will there be a military skirmish in the South China Sea this week, will a particular sanctions package pass, will a ceasefire hold for more than seventy-two hours. Score those predictions against reality, build a calibration curve, and use that as evidence the model has genuine strategic reasoning capability.
That's the superforecasting approach applied to model evaluation.
Right, and it works reasonably well for the narrow problem it addresses. The limitation is what you might call scale invariance failure. Being a well-calibrated forecaster for tactical, short-horizon, data-rich events does not mean you have the kind of strategic wisdom required for long-horizon, low-data, high-stakes scenarios. These are genuinely different cognitive tasks.
It's the difference between predicting whether it'll rain tomorrow and predicting the long-term trajectory of a geopolitical relationship.
The mechanisms are different, the relevant variables are different, the uncertainty structures are different. A model that's excellent at the first might be completely unequipped for the second. And the events you can score in real time — the tactical, near-term ones — are systematically different from the events that actually matter for policy planning. You're validating on the easy cases and extrapolating to the hard ones, which is a shaky inferential leap.
Though I'd argue it's still better than nothing. If a model can't even get the short-horizon calibration right, that's a strong signal something is broken.
As a floor test, absolutely. If you fail the proxy forecasting check, you definitely have a problem. If you pass it, you've cleared a necessary hurdle but not a sufficient one. The methodologies stack — each one eliminates some failure modes, none of them alone is adequate.
Which brings us to the fifth one, process validity, which I think is the most philosophically interesting.
And probably the most underutilized. The insight is: if you can't validate the outcome, validate the reasoning process. Even if you can never know whether the simulation predicted the right geopolitical outcome, you can ask whether the intermediate steps — the chain of reasoning the model followed — are structurally sound.
The "show your work" approach to evaluation.
There's a framework borrowed from military decision theory called the OODA loop — Observe, Orient, Decide, Act. Originally developed by fighter pilot and military strategist John Boyd. The idea for process validation is: does the model actually execute these steps in a principled way? Does it correctly identify the relevant variables in the Observe phase? Does it accurately model the adversary's intent and constraints in the Orient phase? Does the decision it reaches follow coherently from those inputs?
And if the process is sound, you have at least some grounds for trusting the output even without ground truth validation.
That's the argument. And it's genuinely useful because it's separable from outcome validation. You can evaluate process validity right now, with current scenarios, without waiting for history to unfold. The problem — and this is where it breaks down — is what you might call post-hoc rationalization.
The model decides first and explains second.
Which is the core failure mode. LLMs generate tokens sequentially. The "reasoning" they produce is generated in the same pass as the conclusion. There's substantial evidence that when an LLM produces an explanation for a decision, the explanation is often a confabulation — a plausible-sounding narrative constructed after the fact — rather than the actual causal chain that produced the output. The underlying decision was driven by token probability distributions, not by the logical structure of the argument.
So the OODA loop output looks like principled reasoning but might be a post-hoc story the model told itself about a decision it already made.
And distinguishing genuine reasoning from sophisticated confabulation is an open research problem. There are interpretability approaches — mechanistic interpretability work from groups like Anthropic — that try to trace the actual computational path through the model. But applying that to complex multi-step wargame scenarios is not yet tractable at scale. So process validity is theoretically the right thing to measure, and practically very hard to measure rigorously.
Okay, so we've gone through all five. Backtesting fails on contamination, consistency fails on accuracy, expert review fails on subjectivity and the eloquence trap, proxy forecasting fails on scale invariance, process validity fails on confabulation. What does that actually mean for the field?
It means the field has a credibility problem it hasn't fully reckoned with. The gap between what these simulations claim to offer and what can actually be demonstrated is substantial. And the concerning part isn't just that the evaluation methods are imperfect — all evaluation methods are imperfect. The concerning part is that most published work doesn't even engage seriously with the problem. They don't say "we used these methods, here are their limitations, here's why we still think the results are informative." They just skip it, or they do a minimal expert pass and present that as validation.
Why does the field let that slide? Is it just that evaluation is hard and everyone knows it, so there's an implicit agreement not to scrutinize too hard?
Partly that. There's also a structural incentive issue. The people building these simulations are often making claims about their utility for policy. If rigorous evaluation would undermine those claims, there's pressure — not always conscious — to use evaluation frameworks that confirm rather than challenge. And there's a genuine technical difficulty that provides cover. It's easy to say "evaluation is hard in this domain" and then not do it, because that statement is true.
The field is essentially operating on a form of deferred validation. The argument is: we can't prove it now, but trust the framework.
And that argument might be acceptable for a research prototype. It is not acceptable when these tools are moving into policy planning rooms. DARPA's Game Changer platform release in January marks a transition point. At that level of institutional adoption, "we think this is probably useful and some experts found it plausible" is not a sufficient evidentiary standard.
There's also a de-skilling argument worth raising here, because I think it cuts against the whole enterprise in a way that doesn't get enough attention.
The traditional wargame — the kind that military academies and national security institutions have run for decades — is not primarily a forecasting tool. It's a training tool. The point is to put human decision-makers under stress, force them to confront uncertainty, make them experience the friction of incomplete information and time pressure. The simulation is valuable precisely because it's hard and uncomfortable.
And an AI simulation that runs ten thousand scenarios in an afternoon is optimizing for a completely different thing.
It removes the human friction entirely. If the simulation is too fast, too fluent, too authoritative-sounding, it actually undermines the training purpose. You're not learning to handle the fog of war; you're reading a very confident report about how the fog of war was handled. Those are not the same cognitive exercise.
So even if the evaluation problem were solved — even if we could prove the outcomes were well-calibrated — there's a separate question about whether the format is fit for purpose.
Which is maybe the deepest critique. The evaluation debate assumes we know what we want these simulations to do, and we're just trying to verify they do it. But the underlying question of what role AI-generated wargame outputs should play in professional military education and policy analysis hasn't been settled.
Alright, let's make this actionable, because I think there are some concrete things people can actually do with this framework, whether you're building these systems or just encountering their outputs.
The first thing — and this is the one I'd put at the top — is to demand transparency on evaluation methodology before you trust any results. When you see a simulation paper or a policy briefing that cites simulation outputs, the first thing you should check is the evaluation section. What methods did they use? Which ones did they skip? If there's no evaluation section, or it's a paragraph that says "domain experts reviewed the outputs," treat the entire document with significant skepticism.
The absence of rigorous evaluation isn't neutral. It's information. It tells you the authors either couldn't do it or chose not to.
The second thing is that no single methodology is sufficient, and the best current practice is to layer multiple approaches. Use backtesting on post-cutoff events for calibration. Use inter-run consistency as a stability floor. Use expert review with structured rubrics and multiple reviewers, not three people doing an informal pass. Use proxy forecasting to build a calibration curve. Check process validity with explicit OODA loop audits. None of these alone is adequate. Together, they triangulate something more defensible.
It's the same logic as any other complex measurement problem. You don't trust a single instrument; you cross-validate.
And the third thing, which is maybe the most practically useful for someone who's not building simulations but consuming outputs: when you encounter a simulation result, ask specifically what the evaluation framework was, and what failure modes it didn't address. Every methodology we've discussed has known failure modes. A responsible simulation project should be able to tell you which failure modes they addressed and which ones remain open. If they can't answer that question, the results are not ready for policy use.
I want to come back to something you said about the Salesforce and Stanford escalation risks paper, because I think it points to a broader issue about what these models are actually optimized for.
The Rivera et al. paper is important here. The finding about GPT-4 Base defaulting to warlike behavior — including nuclear options — with almost pulp fiction justifications is a signal about what the base model learned from its training distribution. The internet is full of military fiction, war journalism, thriller novels, dramatic historical accounts. The model absorbed those narrative patterns. Without alignment work specifically targeting that bias, it defaults to the most dramatically coherent story, which in military scenarios is usually escalation.
So the model has a genre bias. It's been trained on stories where things escalate because those are the stories that get written.
And that bias is almost certainly present to some degree in every current LLM, regardless of alignment. The alignment work suppresses it, but the underlying pattern is there. Which means that when you run a wargame simulation and the model produces a confident, detailed escalation scenario, you genuinely cannot tell from the output alone whether that reflects sound strategic reasoning or the model defaulting to its training distribution's most common narrative arc.
That's a genuinely uncomfortable place to be if you're using these outputs to inform real decisions.
It is. And it's why the evaluation problem isn't just a technical nicety — it's the difference between a tool that's useful and a tool that's producing sophisticated-sounding noise. The open question for the field is whether we'll see a reproducibility crisis, analogous to what happened in psychology and medicine, where a systematic attempt to replicate key findings reveals that most of them don't hold up. That reckoning probably needs to happen, and it would be healthier if the field initiated it rather than having it forced by a high-profile failure.
What would a credible institutional response look like? Because you mentioned NIST earlier.
Some kind of standardized evaluation framework, ideally with regulatory or institutional teeth, is probably necessary. NIST has done this for other AI domains — their AI Risk Management Framework is an example of how you can create structured evaluation standards without mandating specific technical approaches. Something analogous for AI wargaming and policy simulation would at minimum require disclosure of evaluation methodology, use of multiple validation approaches, and honest reporting of known limitations.
The reproducibility crisis framing is interesting because in medicine, the crisis was actually productive. It produced better methodology, pre-registration of studies, replication requirements. The field came out stronger.
That's the optimistic read. The pessimistic read is that in national security applications, the institutional pressure to appear capable may be stronger than the pressure to be rigorous. There are real incentives — funding, prestige, policy influence — that reward systems that look like they work. The evaluation problem is partly a technical problem and partly a political economy problem.
And the stakes of getting it wrong are somewhat higher than in psychology.
Somewhat, yes. The phrase "Pentagon AI simulation predicts successful naval blockade" appearing in a policy brief carries a weight that a psychology replication failure doesn't. The downstream consequences of misplaced confidence in a badly calibrated simulation are in a different category.
Alright. What are you actually watching for as this field develops?
The thing I'm most interested in is whether the interpretability work catches up fast enough to make process validity a real methodology rather than a theoretical one. If you can actually trace the computational path through the model and verify that the reasoning is genuine rather than confabulated, that changes the evaluation landscape significantly. Right now process validity is the most theoretically sound approach and the least practically tractable. If that gap closes, it becomes the anchor of a real evaluation framework.
And the ICLR twenty twenty-six work on contamination-free backtesting is worth watching too, because if that methodology matures, you at least have a clean calibration signal for forecasting ability.
The field has the pieces. What it doesn't yet have is the institutional will to put them together rigorously and apply them consistently. That's ultimately what needs to change before these simulations can credibly inform the decisions they're being marketed for.
The gap between "this looks impressive" and "this is validated" is the whole ballgame. And right now most of the field is living in the first category and claiming it's the second.
Which is exactly Daniel's point. The brutal honesty is that the credibility of AI-driven policy simulation currently rests on a foundation that most practitioners know is inadequate and most papers don't acknowledge. That's a problem that gets more urgent the closer these tools get to actual decision-making.
Good prompt, Daniel. This is one of those topics where the more you dig, the more uncomfortable it gets, and I think that's the right reaction. Thanks as always to our producer Hilbert Flumingtop for keeping this show running, and big thanks to Modal for the GPU credits that power the whole pipeline — we genuinely couldn't do this without them. This has been My Weird Prompts. If you want to follow us on Spotify, we're there — search My Weird Prompts and hit follow so you don't miss an episode. Take care.
See you next time.