Alright, we are jumping right into the deep end today. Daniel sent us a really thoughtful follow-up to our previous discussions on AI wargaming, specifically regarding those Iran simulations we were picking apart a while back. He actually wrote out his full rationale as the experiment designer, and I want to read this because it sets the stage perfectly for the debate Herman and I have been having.
Daniel writes: Let us talk about experiment design for geopolitical wargaming, as discussed in the episode about the Iran simulations. Herman and Corn were critical of using a single model for both referee and participants. They argued that using different models for participants would introduce different priors, such as training data and alignment tuning. As the experiment designer, my rationale was that routing each perspective to a different model would introduce too much variability attributable to differences in models rather than perspectives. By maintaining the same model across all roles, I aimed to hold a standard variable constant, making the actual delta between perspectives more apparent. Please provide the pros and cons of this decision.
So, Herman Poppleberry, the gauntlet has been thrown down. Daniel is basically saying, look, I wanted a clean lab environment, and you two were complaining that the lab was too sterile to be real. By the way, before we tear this apart, I should mention that today’s episode is powered by Google Gemini three Flash. It is helping us navigate this diplomatic minefield. So, Herman, Daniel wants the pros and cons of the single-model approach. Where do you want to start?
I love that Daniel sent this because it touches on the fundamental tension in any simulation: do you want a controlled experiment or a high-fidelity recreation? Those two goals are often at war with each other. When Daniel says he wanted to hold a standard variable constant, he is making a very strong scientific argument. In traditional research, you want to isolate the cause. If Agent USA acts differently than Agent Iran, and they are both powered by, say, GPT-four-o, then you know for a fact that the difference is due to the system prompt. It is the instructions, the persona, and the context window that changed the behavior, not the underlying architecture.
Right, if you used Claude for the United States and Gemini for Iran, and Iran suddenly decided to launch a preemptive strike, you would be left scratching your head. Was that a calculated geopolitical move based on the simulation, or is the Iranian model just naturally more bellicose because of its training data? You lose the ability to say the prompt caused the behavior.
That is the biggest pro for Daniel’s side. It is about variable isolation. If you are trying to test how a specific set of instructions influences decision-making, you need a stable baseline. Using one model ensures architectural consistency. The way the model processes probability, the way it understands causality, and even its tokenization patterns are identical across all players. This is especially important when you bring in the Referee. If the Referee and the participants share the same brain, so to speak, you reduce the risk of what I call hallucination mismatch.
Explain that. What is a hallucination mismatch in this context?
Imagine Agent USA uses a very specific, nuanced diplomatic term that only one specific model tends to use in a certain way. If the Referee is a different model, it might fundamentally misunderstand the intent or the weight behind that word. When they are the same model, the Referee is essentially grading its own homework. It understands the logic of the participants because it uses the same logic. It makes the simulation run much more smoothly from a technical standpoint. You aren't fighting against different API latencies or weird cross-model logic errors.
Okay, so it’s cleaner, it’s cheaper, and it’s easier to debug. I get that. But here is my cheeky counter-point: if the Referee is grading its own homework, isn't that just a recipe for a massive confirmation bias loop? If the model has a blind spot, the whole simulation has that blind spot. There is no one to call foul because everyone in the room has the same cognitive defect.
And that brings us to the cons, which is where you and I usually start leaning on the table and getting animated. The biggest issue with the single-model approach is what we call the monoculture of reasoning. When you use one model family, you are essentially seeing the world through a single lens. Every top-tier LLM today is a product of specific alignment tuning, mostly done in San Francisco or by Western-adjacent teams. They have been trained to be helpful, harmless, and honest according to a very specific set of cultural and ethical guidelines.
Well, I shouldn't say exactly because I know how much you hate that word. But that is the core of it. If you ask a single model to play Iran, it isn't actually being Iran. It is a Western-aligned AI doing its best impression of what a Western-aligned AI thinks Iran would do. It is like watching a high school play where all the actors are from the same neighborhood trying to do different accents. The underlying "priors"—the deep-seated beliefs and historical baggage—just aren't there.
It leads to the mirror image problem. In intelligence circles, mirror imaging is the failure to realize that your adversary thinks differently than you do. You assume they are rational actors using your definition of rationality. If you use GPT-four-o for both sides, you are guaranteed to have a simulation filled with actors who share the same fundamental logic. They will likely converge on a polite, semi-rational middle ground because that is what the model has been reinforced to do. Geopolitics is rarely that polite or that consistently rational in the way a Silicon Valley AI is.
It’s the alignment homogenization. If the model has a safety layer that prevents it from being "too aggressive" or "harmful," then your Red Team actor—Iran in this case—is effectively fighting with one hand tied behind its back. It can't truly simulate an actor that might have a completely different threshold for risk or a different moral framework regarding escalation. You end up with a simulation of how two Western diplomats would argue if they were forced to wear different hats.
This is where the training data diversity comes in. If you were to use a model that was specifically fine-tuned on Persian-language documents, regional news from the Middle East, or even historical texts that haven't been heavily filtered through Western reinforcement learning from human feedback, you would get a much more authentic prior. The "delta" Daniel is looking for isn't just in the prompt; it’s in the soul of the model, if we can use that term loosely. By stripping that away to keep the variable constant, you might be throwing the baby out with the bathwater. You’ve made the experiment clean, but you’ve made the results irrelevant to the real world.
It’s the difference between a physics experiment in a vacuum and trying to predict the weather. Sure, the vacuum experiment is more "scientific" in a controlled sense, but it won't tell you if it’s going to rain on Tuesday. If we want to know how a conflict might actually escalate, we need the friction of different models. We need the messiness.
I think there is a middle ground though. We shouldn't just say Daniel was wrong. There is a reason he did it. Think about the complexity of managing a twenty-four-agent swarm. If you have twenty-four agents and you are trying to balance four different models across them, the "noise" becomes overwhelming. You wouldn't know if a breakdown in communication happened because of a diplomatic failure or because Claude and GPT-four-o had a disagreement about how to format a JSON object.
That is a fair point. Technical stability matters. But what about the Referee? That was one of our biggest gripes. If the Referee is the same model as the players, it’s not just grading its own homework— it’s reinforcing its own hallucinations. If the participant model makes a weird leap in logic that is characteristic of that model, the Referee model is likely to accept that leap because it makes the same kind of leaps.
That is the self-licking ice cream cone effect. A better design, even if you stay within one model family, might be to use a larger model for the Referee and smaller ones for the participants. For example, use GPT-four-o-mini for the agents and the full GPT-four-o for the Referee. At least then you have a hierarchy of reasoning. The "judge" has a higher cognitive capacity than the "players." It can look down on the logic being generated and say, "Wait, that doesn't actually follow."
But even then, you are still inside the same house. You haven't escaped the fundamental alignment of the model family. Let's look at the "Red Team" model idea. Some researchers are suggesting that for true geopolitical wargaming, you should use a Western model for the Blue Team—the U.S. and allies—and an open-source, less-aligned model for the Red Team. Something like a Llama-three that you have fine-tuned yourself to be more aggressive or to reflect the specific strategic culture of the adversary.
That is where it gets really interesting. If you take a Llama-three-seventy-B and you intentionally "un-align" it—strip away some of those helpfulness constraints—you might get a much more realistic simulation of a desperate or highly ideological adversary. The friction between that model and a highly-aligned GPT-four-o would produce insights that a single-model simulation never could. You would see where the "politeness" of the Western model actually becomes a strategic weakness.
So, if we are listing pros and cons for Daniel, the "pro" is that he has a perfect control group. He can say with one hundred percent certainty that any change in behavior is prompt-driven. The "con" is that he is testing a world that doesn't exist—a world where everyone thinks exactly like a San Francisco AI. It’s a simulation of AI behavior, not a simulation of human geopolitics.
And I think Daniel would argue that his goal was to test AI behavior. If you are an experiment designer, sometimes you are just trying to understand the tool. But for us, looking at it from a geopolitical and security perspective, we want the tool to tell us something about the world. That is the disconnect. It is a difference in objectives.
I also think about the "escalation risk" research coming out of places like Stanford. They found that different models have vastly different levels of bellicosity. Some models are naturally "hawkish"—they want to resolve conflict through strength—while others are "dovish" to a fault. If you only use one model, you are stuck with whatever its baseline temperament is. If your chosen model happens to be a "dove," your Iran simulation will always end in a grand bargain. If it’s a "hawk," it always ends in a regional war. You need multiple models just to see the range of possibilities.
It’s like the Monte Carlo method but for cognitive architectures. You don't just run the same simulation a thousand times with the same model; you run it with a hundred different combinations of models. That is how you get a true probability distribution of outcomes. Otherwise, you are just finding the "true North" of one specific model’s weights and biases.
Which is fascinating if you are a computer scientist, but potentially dangerous if you are a policy maker using this to decide where to send a carrier group.
Precisely. Well, not precisely. I mean, you are right. The danger is over-reliance on a single point of failure. If GPT-four-o has a specific misunderstanding of Iranian naval doctrine because of a gap in its training data, and you use it for both Iran and the Referee, that gap becomes an invisible wall in your simulation. You will never even know it is there. With multiple models, the odds of all of them having the same specific gap in the same specific place are much lower.
It’s the "adversarial" part of wargaming. You need an adversary that actually surprises you. A single model is rarely going to surprise itself. It’s too consistent. It’s too logical in its own specific way. Real war is a clash of different logics, different histories, and different levels of desperation.
I want to go back to Daniel’s point about "noise." He is worried that using different models introduces "too much variability." And he is right. It does. But in geopolitics, that variability isn't noise—it’s the signal. The fact that an Iranian leader thinks differently than an American leader isn't a "variable" you want to hold constant; it’s the primary variable you are trying to study. By holding it constant, you are effectively deleting the subject of the experiment.
That is a great way to put it. It’s like trying to study how different chemicals react, but deciding to make all the chemicals out of water first so the experiment stays "clean." You end up with a very clean experiment that tells you absolutely nothing about chemistry.
It is a classic case of "physics envy" in the social sciences. We want the world to be as predictable and measurable as a ball rolling down an inclined plane. So we strip away all the friction until the model works perfectly, but the model no longer represents the reality of the inclined plane. Daniel’s design is elegant from a data science perspective, but it’s fragile from a geopolitical perspective.
So, what’s the takeaway for Daniel? If he were to run this again, how does he fix it without the "noise" making the data useless?
I think the answer is an ensemble approach. You don't just pick one or the other. You run a set of trials with the single-model setup to establish your "control" delta—the baseline of what the prompts alone can do. Then, you run a set of trials where you swap out the Red Team for a different model family. Then you swap the Referee. You look for the "invariant" results—the things that happen regardless of which model is playing which role. If Iran always closes the Strait of Hormuz, whether it’s being played by GPT, Claude, or a fine-tuned Llama, then you’ve found something real. That is a robust insight.
That makes a lot of sense. Use the single-model approach as a benchmark, not as the final word. It’s the starting point. It tells you what the "generic AI" thinks. Then you add the layers of complexity to see if that conclusion holds up under pressure.
And you have to be honest about the alignment. If you are using a model that has been heavily RLHF’d to be a "helpful assistant," you have to acknowledge that your adversary is basically a librarian in a general’s uniform. They are going to be biased toward negotiation and de-escalation because that is what they were literally paid—in compute tokens—to do.
It’s the "polite adversary" problem. I’ve seen this in smaller-scale simulations too. The "terrorist" agent will start its response with, "As a responsible actor, I must consider the humanitarian implications of my actions before I proceed with this attack." And you’re just sitting there like, "No! That is the opposite of what is happening here!"
It is the "safety-first" bias. And while that is great for a customer service chatbot, it is a disaster for a wargame. It creates a false sense of security. You think, "Oh look, even in the worst-case scenario, the AI found a way to talk it out." But the only reason it talked it out is because the model’s "brain" is physically incapable of choosing the "rude" or "violent" option, even when that option is the most realistic one for the persona.
This actually links back to some of the stuff we talked about in the "Fog of War" episode. If you have that information leakage because of the shared model, the "surprise" factor is gone. The models aren't just thinking similarly; they are essentially sharing a subconscious. They can "sense" the intent of the other side because they would have the same intent in that situation.
It’s a hive mind. A twenty-four-agent hive mind pretending to be twenty-four different people. It’s a fascinating technical achievement, and I don't want to diminish what Daniel built. It is an incredible piece of engineering. But as a tool for understanding conflict, it has these inherent "blind spots" that are baked into the very decision to make it "clean."
I think Daniel’s rationale really highlights the bridge we are trying to cross right now in AI. We are moving from "AI as a tool that follows instructions" to "AI as a representative of a perspective." And we aren't quite there yet. We are still trying to force these models to be something they aren't—which is diverse.
And that is the irony. We talk about "diversity" in AI training all the time, but when it comes to the actual deployment of these models in high-stakes simulations, we tend to go for the most "uniform" and "reliable" option. We choose the model that is most consistent, which is exactly the opposite of what you want in a wargame. You want the model that is going to throw a wrench in the gears. You want the "black swan" event.
So, pros: Variable isolation, technical stability, architectural consistency, lower cost, and a clear "delta" for prompt testing. Cons: Monoculture of reasoning, alignment homogenization, mirror-imaging bias, lack of cultural "priors," and a Referee that is essentially an echo chamber.
That is a solid summary. I would also add one more "pro" for Daniel: it is a much better way to "unit test" your system prompts. If you are a developer, and you want to make sure your "Agent Iran" prompt actually works, you need a stable environment to test it. You can't have the model changing under your feet while you are trying to tweak the adjectives in your prompt.
That is the "developer" perspective versus the "analyst" perspective. The developer wants a predictable system. The analyst wants a realistic one. And in the world of AI wargaming, those two are currently at odds.
I think the future is going to be about specialized "adversarial" models. We will have models that are intentionally trained on the "bad" data—the extremist forums, the fringe geopolitical theories, the historical records of failed states. Not because we want the AI to be those things in the real world, but because we need those perspectives to be represented in the simulation if we want it to be worth anything.
It’s like hiring a "Red Team" in cybersecurity. You don't hire the guy who is going to be "helpful and harmless." You hire the guy who knows how to break things. We need a "broken" model for the Red Team.
Well, I mean, you’re on the right track. A "broken" or "differently aligned" model is essential for breaking the "monoculture" of the simulation.
I’ll take "on the right track" over "exactly" any day. So, Daniel, I hope that answers the question. It’s not that your rationale was wrong; it’s that it was a developer’s rationale in a field that desperately needs the analyst’s "messiness." It’s the tension between the lab and the world.
And I would love to see a "Version Two" of this where Daniel intentionally introduces a "chaotic" variable. Use a different model for just one of the key actors and see if the whole thing falls apart or if it produces something more interesting. That "noise" might just be the most valuable data point you have.
It’s the friction that creates the heat. Without the friction of different models, you’re just rubbing two pieces of ice together and wondering why you can’t start a fire.
That is actually a great analogy. I’ll allow it. One per episode, right?
I’m allowed one. I used it well. It’s cold, it’s clean, it’s Daniel’s simulation. It’s perfect, but it’s freezing. We need some sparks.
I think we have plenty of sparks in this discussion. It really gets to the heart of what we are trying to do here. We aren't just talking about code; we are talking about how we model the most complex human interactions using tools that are still, fundamentally, just very sophisticated word-predictors.
Word-predictors with a very specific, very San Francisco-based moral compass.
Which is fine, as long as you know that is what you are looking at. The danger is forgetting that the compass only points in one direction.
Well, this has been a deep one. I feel like we’ve defended our honor while still giving Daniel the credit he deserves for the engineering feat. It’s a tough balance.
He knows we love the work. That is why we poke at it. If it weren't worth talking about, we wouldn't be twenty minutes into a deep dive on experiment design.
True. And I think this conversation is going to be even more relevant as these tools get integrated into actual policy-making. We need to have these debates now, before someone decides that a "clean" simulation is the same thing as a "correct" one.
That is the most important takeaway. "Clean" does not equal "Correct." In fact, in the social sciences, "clean" often means "oversimplified to the point of uselessness."
On that cheery note, I think we can wrap this one up. We’ve covered the pros, the cons, the mirror images, and the ice cubes.
And the self-licking ice cream cones. Don't forget those.
Never. It’s one of my favorite mental images.
Thanks as always to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes.
And a big thanks to Modal for providing the GPU credits that power this show. They make it possible for us to run these kinds of experiments—and talk about them—every week.
This has been My Weird Prompts.
If you’re enjoying these deep dives into the guts of AI and geopolitics, a quick review on your podcast app really does help us out. It helps new listeners find the show and keeps us motivated to keep digging.
We'll see you in the next one.
Stay weird.
Goodbye.