Imagine you run a geopolitical crisis simulation one hundred times. A Taiwan Strait standoff, say. You get one hundred different outcomes. Some cluster around tense-but-managed de-escalation. A handful spiral into kinetic exchanges. A few outliers do something genuinely unexpected. Now here's the question that should keep any serious analyst up at night: is that spread signal or noise? Is it telling you something real about the scenario, or is it just an artifact of how a language model picks its next word?
That tension is exactly what makes Monte Carlo methodology in LLM wargaming such a fascinating and underexplored problem. And it's not just academic. As these simulations move from hobbyist side projects into actual policy workflows, think-tanks using them to brief decision-makers, the methodology question becomes genuinely high stakes. Get it wrong and you're presenting sampling noise as geopolitical insight.
So Daniel sent us this one, and it's a meaty one. He's asking about stochasticity, seed control, and Monte Carlo methodology in LLM-based wargaming. Specifically: when should a simulation be deterministic versus stochastic, how do the temperature and sampling knobs actually affect actor behavior, what are the minimum viable run counts for different levels of rigor, how do you report results without overclaiming, and, the one that I think is the most philosophically interesting, what is the actual relationship between LLM variance and real-world uncertainty? Because those are not the same thing, and a lot of people seem to be treating them as if they are.
They are very much not the same thing, and we'll get into that. Quick note before we dive in: today's script is being generated by Claude Sonnet 4.6, which feels appropriately meta given we're talking about what LLMs actually do under the hood.
Layers on layers. Alright, let's start with the foundational split. Deterministic versus stochastic runs. Because I think a lot of people coming to this from a traditional modeling background assume you'd always want reproducibility, and the answer turns out to be more nuanced than that.
The intuition for determinism is sound. If you fix a seed and set temperature to zero, you theoretically get the same output every time for the same input. That's enormously valuable for a specific set of tasks. Debugging your prompt logic, running what are essentially unit tests on agent behavior, or producing a single canonical narrative for a briefing document. If your simulation of a North Korean actor does something unexpected, you need to be able to reproduce it to figure out why. You can't debug a probabilistic ghost.
Right, it's the scientific reproducibility argument. If your result can't be reproduced, it's not a result.
But here's where it gets interesting, and this is something Stumpf documented in work from twenty twenty-four on what he called "batch invariance" in LLMs. Even with a fixed seed and temperature zero, you don't actually get guaranteed determinism in practice. Hardware-level floating-point accumulation in GPU kernels can introduce tiny numerical differences that compound over long sequences. So determinism in LLMs is better described as an asymptotic goal than a hard guarantee. You get very high reproducibility, but not perfect reproducibility, especially in long-form outputs like multi-turn wargame narratives.
So even your "fixed" run has a little chaos baked in at the hardware level. That's uncomfortable.
It is. Which is part of why the framing of "deterministic versus stochastic" is a bit of a false binary. What you're really choosing between is high-reproducibility baseline runs and deliberate stochastic ensembles. And the stochastic ensemble is where Monte Carlo methodology comes in. You run the same scenario, same prompt, same actors, but with temperature above zero, and you run it many times. What you get back isn't a single outcome. It's a distribution of outcomes.
And that distribution is doing real work. Because a single run might give you a peaceful resolution, and you could walk away thinking the scenario is stable. But if you run it fifty times and find that peaceful resolution in forty-two runs, escalation to limited kinetic exchange in six, and full-scale conflict in two, that two percent tail is suddenly very relevant.
That's the core value proposition. Monte Carlo wargaming lets you surface tail risks that a deterministic run structurally cannot show you. Rivera et al. from Stanford HAI in twenty twenty-four, in their work on escalation risks from LLMs in military and diplomatic contexts, found that single-run simulations systematically underestimated escalation risk precisely because the most probable output, which is what a low-temperature run produces, tends to be the most conventional, most "safe" resolution. The model is trained on text where most diplomatic crises don't result in war, so the highest-probability completion is usually some form of negotiated outcome.
Which is a fascinating and slightly alarming property. The model's prior is essentially "history usually doesn't end in catastrophe," so low-temperature runs are optimistic by default.
Structurally optimistic, yes. And that's a real problem if you're using the simulation to stress-test policy assumptions.
Okay, so let's talk about the knobs themselves, because I think this is where a lot of practitioners get into trouble. Temperature is the one everyone knows, but there's more to it than "higher equals more random."
Temperature is controlling the shape of the probability distribution over the next token. At temperature zero, you're doing pure greedy decoding, always picking the highest-probability token. As you increase temperature, you're flattening that distribution, making lower-probability tokens relatively more competitive. At temperature one, you're sampling proportionally to the raw probabilities. Above one, you start making the distribution more uniform than the model's actual beliefs, which is where things get genuinely erratic.
And in a wargaming context, that translates directly to actor behavior.
Very directly. Rivera et al. documented this explicitly. At low temperatures, a persona prompted as a "dovish" diplomatic actor behaves consistently dovishly. At temperatures around zero-point-nine to one-point-two, that same persona will occasionally produce hawkish responses, not because the underlying prompt changed, but because token sampling noise pushed the completion in an unexpected direction. The persona becomes unstable. What you modeled as a cautious foreign minister occasionally acts like a risk-tolerant hawk, purely as a mathematical artifact.
Which, on one hand, you could argue that's realistic, leaders do sometimes act out of character. But on the other hand, if that behavior is driven by sampling noise rather than anything in the scenario design, you haven't learned anything about the scenario.
That's the crux of the whole honest problem we'll get to later. But yes, the practical implication for practitioners is that temperature is not just a creativity dial. In wargaming, it's effectively a risk-tolerance dial for your agents. Low temperature gives you cautious, conventional actors. High temperature gives you unpredictable, occasionally irrational actors. And the question of which is more "realistic" is actually a substantive modeling choice, not a technical default.
What about top-p? Because I feel like that one gets less attention but it's doing something distinct.
Top-p, or nucleus sampling, constrains the pool of tokens the model can sample from. Instead of considering all possible next tokens, you limit sampling to the smallest set of tokens whose cumulative probability exceeds the threshold p. So at top-p of zero-point-nine, you're only sampling from tokens that together account for ninety percent of the probability mass.
So it's cutting off the very long tail of improbable completions.
Cutting off the extreme tail, yes. The interaction with temperature is important here. High temperature flattens the distribution, which means more tokens make it into the nucleus at a given top-p value. So high temperature plus high top-p is a double dose of unpredictability. In wargaming terms, that's an actor who might produce genuinely novel strategic gambits, things that are statistically unlikely but linguistically possible within their persona. Whether that's a feature or a bug depends on what you're trying to learn.
And there's a case for it, right? Real strategic actors sometimes do genuinely surprising things. The question is whether the LLM's version of "surprising" maps onto anything meaningful.
Which loops back to the fundamental methodological question. Let's get into the rigor tiers, because I think this is where practitioners most need concrete guidance and where the field has actually made some progress.
Walk me through them.
There's a framework that's emerged in the literature, codified fairly explicitly in Verma et al.'s twenty twenty-five paper on Monte Carlo sampling frameworks for reliable LLM evaluation. Three tiers, each with different run count requirements, different statistical expectations, and different appropriate claims.
Start at the bottom.
Tier one is what you might call the baseline or hobbyist tier. Ten to thirty runs. The methodology here is simple: look at the mode of outcomes, maybe compute a rough mean if your outcomes are numerical, and ask whether results are clustering. This is a vibe check in the most literal sense. If you run a Taiwan Strait scenario thirty times and blue wins twenty-four times, you have a rough signal that blue has a structural advantage in how the scenario is modeled. You cannot make strong statistical claims. You cannot report confidence intervals with any integrity. But you can say "the outcomes cluster around X, with occasional deviation toward Y."
Thirty runs feels like a lot until you realize it's actually not that many given the variance in LLM outputs.
It's genuinely not many. At thirty runs you're probably still seeing sampling artifacts that would wash out with more data. But for a hobbyist project, or for initial exploratory work before committing compute to a serious experiment, thirty runs is the minimum viable standard. Below ten, you're essentially doing anecdote collection, not simulation.
What does tier two look like?
Tier two is the serious experiment tier. This is what a published paper or a credible think-tank report would need. The minimum here is around one hundred runs, and the literature, specifically that twenty twenty-four study on LLM-based wargaming for policy-relevant conclusions, puts one hundred as the floor for making any claim that's meant to inform real decisions. But the methodology changes significantly. You're not just reporting the mode. You're computing variance, reporting confidence intervals, and running sensitivity analysis.
What does sensitivity analysis look like in this context?
You're systematically varying your key parameters and measuring how much the outcome distribution shifts. So you might run one hundred simulations with your baseline persona prompts, then another hundred with slightly modified persona descriptions, say, making your diplomatic actor slightly more risk-tolerant, and then you measure whether the outcome distribution shifts significantly. If a small change in persona description dramatically changes your escalation rate, that's a signal that your results are sensitive to prompt engineering choices, which is a major validity concern.
Because if your result is highly sensitive to exactly how you worded the persona prompt, that's not a robust finding about the scenario. It's a finding about your prompt.
And that distinction matters enormously for how you present results. A think-tank report that doesn't include sensitivity analysis on persona prompts is essentially hiding a major source of variance. The Cecere et al. work from Amazon Science in twenty twenty-five introduced what they call Monte Carlo Temperature sampling specifically to address this, essentially running ensembles not just over different random seeds but over different temperature settings, to map how sensitive conclusions are to that parameter.
That's clever. Instead of just running one hundred simulations at temperature zero-point-seven, you run batches at zero-point-five, zero-point-seven, and zero-point-nine, and see how the distribution shifts.
And if your key finding, say, "blue achieves its objectives in seventy percent of scenarios," only holds at one specific temperature setting, that's a red flag. A robust result should be relatively stable across a reasonable range of temperature values. If it evaporates when you change temperature from zero-point-seven to zero-point-eight, you probably don't have a finding about the scenario. You have a finding about the model's behavior at that specific temperature.
Okay, tier three. Money is no object.
Tier three is one thousand to ten thousand runs. And this is where you're doing something qualitatively different from the lower tiers. You're not just trying to get stable estimates of the central tendency. You're mapping the full distribution, including the deep tail. At one thousand runs, you can start making meaningful claims about events that occur with five percent frequency or lower. At ten thousand, you're looking for the zero-point-one percent failure modes, the genuine black swans in the scenario design.
What kind of infrastructure does that require? Because ten thousand LLM calls for a multi-turn wargame scenario is not trivial.
It's substantial. A multi-turn scenario with several actors might involve fifty to a hundred LLM calls per run. At ten thousand runs, you're talking five hundred thousand to a million API calls. Even at current API pricing, that's a significant compute budget. Which is why this tier is genuinely reserved for situations where the stakes justify it. A government defense agency stress-testing nuclear escalation scenarios, for instance, where finding that zero-point-one percent tail risk could be worth considerable investment.
And there's a ceiling on this, right? Because the intuition is "more runs, more better," but that's not actually true past a certain point.
Diminishing returns kick in hard around five hundred to one thousand runs for most standard strategic scenarios. And this is where the methodology gets philosophically interesting. Beyond roughly one thousand runs, you're not learning more about the scenario. You're learning more about the model's internal linguistic quirks. You're mapping the LLM's probability space with increasing precision, but that probability space is a property of the model's training data and architecture, not a property of the real-world scenario you're trying to simulate.
So there's a ceiling imposed by the simulation itself, not just by statistics.
The simulation's inherent ceiling, yes. The Lamparth et al. work on LLMs for open-ended wargames, from twenty twenty-four, makes this point explicitly. At some point you've characterized the model's behavior exhaustively, and running more simulations just gives you a more precise map of a territory that was already fully explored. The question you should be asking is whether that territory corresponds to anything real, and more runs don't help you answer that.
Which is a perfect segue into the reporting question. Because even if you've done everything right methodologically, how you present the results matters enormously for whether they get misused.
This is where I think the field has the most room for improvement. The temptation when you have a large ensemble is to present summary statistics as if they're predictions. "The simulation showed a twenty percent chance of escalation to kinetic conflict." That sounds authoritative. It sounds like a probability estimate. But it's not, and presenting it that way is a form of overclaiming that can genuinely mislead decision-makers.
What's the responsible framing?
A few principles. First, report distributions, not point estimates. Instead of "the AI chose peace," you say "in seventy-two percent of simulations, the model's output reflected diplomatic resolution; in twenty-eight percent, it reflected escalation to kinetic force." That framing preserves the distributional nature of the result.
And it signals to the reader that this is a range of outcomes, not a single prediction.
Second, always disclose your temperature settings, your top-p settings, your seed values for any deterministic baselines, and your run count. These are the methodological parameters that any sophisticated reader needs to evaluate your results. A wargame report that doesn't include these is not reproducible and arguably not scientific.
It's the equivalent of a chemistry paper that doesn't report reaction conditions.
Third, and this is the hardest one to implement because it requires disciplined language, avoid anthropomorphic certainty. Instead of "the AI decided to escalate," you say "the model's probability space favors escalation under these conditions." The distinction matters because "the AI decided" implies agency, intentionality, and by extension, predictive validity about what real actors might do. "The model's probability space" correctly locates the finding inside the model's linguistic behavior.
That's a subtle but important distinction. Because decision-makers reading these reports may not have the technical background to distinguish between "the simulation showed X" and "X is likely."
And the risk of conflation is high precisely because the outputs look so plausible. LLMs produce fluent, coherent, contextually appropriate text. A wargame narrative where the Chinese foreign minister makes a specific diplomatic gambit reads like something a real foreign minister might say. That surface plausibility is seductive. It's very easy to forget that you're reading the most probable linguistic completion given a prompt, not a forecast of actual behavior.
Okay, let's get into the honest problem directly, because I think this is the most important thing in this whole episode and also the most underappreciated.
The honest problem is the conflation of two completely different kinds of uncertainty. LLM stochasticity and geopolitical uncertainty. They look similar from the outside, both produce variance in outcomes, but they have completely different sources and completely different implications.
Break down the distinction.
LLM variance, the spread you see across your Monte Carlo ensemble, is caused by the model's sampling process. When the model generates a token, it's sampling from a probability distribution over its vocabulary. If "escalate" has a thirty percent probability and "negotiate" has a fifty percent probability and "withdraw" has a twenty percent probability, the model will sometimes output each of those. The variance in your simulation outcomes is a direct reflection of those probability distributions, which are themselves a reflection of patterns in the model's training data.
So the "uncertainty" is really uncertainty about which token the model will sample next.
It's mathematical uncertainty about a sampling process. Geopolitical uncertainty is something categorically different. It's uncertainty arising from incomplete intelligence, from the fog of war, from the genuine unpredictability of human psychology under pressure, from information that doesn't exist in any training dataset because it hasn't happened yet. A real foreign minister making a real decision under real pressure is operating in a world of genuine uncertainty that no training corpus can fully capture.
So when your simulation shows a twenty percent escalation rate across one hundred runs, that twenty percent is not a probability estimate about the real world. It's a frequency count of how often the model's sampling process landed on "escalation" given your prompt.
And those are very different things. The twenty percent could be inflated because the model's training data over-represents escalatory historical examples. It could be deflated because the model has learned a "diplomatic resolution" prior from the overwhelming majority of non-escalatory diplomatic history in its training data. It could be sensitive to specific word choices in your persona prompts in ways that have no real-world correlate. The number is real, but what it means about the actual scenario is genuinely unclear.
Which doesn't mean the simulation is useless. But it means you have to be very careful about the claims you make.
Exactly the right framing. The simulation is a tool for exploring the model's probability space, which can be informative about how the scenario is structured, what kinds of outcomes are consistent with the model's understanding of the actors involved, and what the sensitivity of outcomes is to different parameter choices. That's useful. But it's a different kind of useful than "this simulation predicts a twenty percent chance of war."
And I think there's a temptation, especially in policy contexts, to want the stronger claim. Decision-makers want numbers they can act on. "Twenty percent chance of escalation" feels actionable. "The model's probability space reflects escalatory completions twenty percent of the time under these prompt conditions" feels like a hedge.
It is a hedge, and it's the right hedge. The field needs to resist the pressure to overclaim, because the cost of overclaiming in policy contexts is potentially very high. If a decision-maker acts on a simulation-derived probability estimate as if it were a calibrated real-world forecast, and the simulation's twenty percent is actually a sampling artifact of how the persona prompt was worded, that's a serious methodological failure with real-world consequences.
The Rivera et al. work from Stanford HAI is pretty direct about this, right? They found that LLMs in military simulations showed escalatory behavior patterns that didn't necessarily map onto anything about real strategic actors.
Their findings were striking. Even when personas were carefully designed to represent dovish, de-escalatory actors, higher temperature settings produced hawkish behavior, not because the persona design was wrong, but because sampling noise at higher temperatures destabilizes persona consistency. And the implication they drew, which I think is important, is that any simulation result produced at high temperature has to be interpreted with significant caution, because you can't cleanly separate "the scenario produces this outcome" from "the temperature setting produced this outcome."
So let's pull this together into practical takeaways. If you're a practitioner building LLM wargames, what are the concrete things you should actually do?
First thing: decide upfront whether you're running a deterministic baseline or a stochastic ensemble, and be clear about which you're doing and why. Use deterministic runs with fixed seeds and low temperature for debugging, prompt testing, and producing reproducible audit trails. Use stochastic ensembles for anything where you want to characterize the distribution of outcomes. Don't mix the two without being explicit about it.
And document your seeds. This sounds obvious but apparently it's not universal practice.
Not universal at all. If you don't record the seed values for your deterministic runs, your results are not reproducible, period. And if you're running stochastic ensembles, document your temperature, your top-p, your run count, and ideally your API version, because model updates can shift the probability distributions in ways that change your results.
Second takeaway?
Run count discipline. For exploratory work, thirty runs is your floor. For anything you're going to present externally or use to inform decisions, you need a minimum of one hundred runs with variance reporting. And when you report results, report the full distribution, not just the modal outcome. "Outcomes clustered around peaceful resolution in sixty-eight percent of runs, with limited escalation in twenty-four percent and full escalation in eight percent" is a responsible result. "The simulation showed peace" is not.
Third?
Sensitivity analysis is non-optional for serious work. Systematically vary your temperature settings across a reasonable range, maybe zero-point-five to zero-point-nine for most scenarios, and report how much your key outcomes shift. If your finding is robust across that range, you have more confidence it reflects something about the scenario structure rather than a temperature artifact. If it evaporates when you move from zero-point-seven to zero-point-eight, flag that prominently.
And the language discipline point.
The language discipline is the hardest one to maintain under deadline pressure, but it matters most for policy impact. Always locate findings inside the model's behavior rather than projecting them onto real-world actors. "The model's outputs suggest..." rather than "the simulation predicts..." And never, ever present a simulation-derived percentage as a probability estimate without extensive qualification about what that percentage actually represents.
There's also an open question about where this is all going. Because the stochasticity problem is partly a function of current transformer architectures and sampling methods. If future architectures move toward more deterministic generation, or if we develop better ways to calibrate LLM uncertainty against real-world outcomes, the whole methodology might shift.
The Cecere et al. work on Monte Carlo Temperature sampling is an early step in that direction. They're essentially treating the temperature parameter itself as a random variable and sampling over it, which gives you a richer characterization of the model's uncertainty. But the fundamental gap between model variance and real-world uncertainty isn't going away with architectural improvements. Even a perfectly calibrated LLM that accurately represents its own uncertainty is still making predictions based on training data patterns, not on access to the actual state of the world.
So the epistemological problem persists even as the technical tools improve.
It persists, and I think that's actually healthy. The right response isn't to develop better tools that let you overclaim more confidently. It's to develop better methodological norms that help practitioners and consumers of these simulations understand exactly what they're looking at. Monte Carlo wargaming is a genuinely useful tool. It can surface tail risks, stress-test scenario designs, and reveal how sensitive outcomes are to modeling assumptions. That's real value. But it's a tool for exploring a model's probability space, not a crystal ball.
And as compute costs drop, which they will, the temptation to run huge ensembles and present the results as authoritative forecasts is only going to increase. So getting the methodological norms right now, while the field is still being established, matters a lot.
The Lamparth work on open-ended wargames makes the point that the field is at an inflection point. These tools are moving from research curiosities to policy-adjacent applications faster than the methodological standards are developing. Which is a familiar story in technology, but it's a particularly high-stakes version of it when the application domain is military and diplomatic decision-making.
Alright. Big picture summary before we close: the core tension in LLM wargaming is that you need stochasticity to surface real distributional information about your scenario, but that stochasticity is sampling noise, not geopolitical signal. Monte Carlo methodology gives you the tools to characterize that distribution rigorously. But rigor means being honest about what the distribution actually represents.
And the three-tier framework gives practitioners a concrete scaffold. Thirty runs to see if outcomes cluster. One hundred to make any external-facing claim with variance reporting. A thousand or more if you're specifically hunting for tail risks and have the compute budget to justify it. Beyond that, you're mapping the model, not the scenario.
The honest problem isn't a reason not to do these simulations. It's a reason to do them carefully and report them honestly.
Which is, when you think about it, the same standard we'd apply to any empirical methodology. The tool is only as good as the rigor you bring to it.
Thanks as always to our producer Hilbert Flumingtop for putting this together. Big thanks to Modal for the GPU credits that keep this whole operation running. This has been My Weird Prompts. If you want to get notified when new episodes drop, search for My Weird Prompts on Telegram. Until next time.
Take care everyone.