Alright, we've got a good one today. I want to do a proper code review meets design analysis of Snowglobe — that's the open-source LLM wargaming framework from IQT Labs. The repo just got archived on March 18th, which makes this a perfect retrospective moment. We're talking five hundred eighty-one commits, twelve releases, v1.0.0 shipped in September 2025, and then archived just weeks ago. So the question is: what did they actually build, how did they build it, and what engineering decisions are worth stealing?
The timing is genuinely interesting for a retrospective. The codebase is frozen now, which means we can analyze it without the ground shifting under us. And it's worth noting this wasn't a toy project — in April 2025, they ran a real wargame with six human participants using this system, and it got written up in the CIA's Studies in Intelligence journal, Volume 69, Number 4, December 2025. So we're talking about research code that made it to operational deployment within about a year of the original arXiv paper.
By the way, today's episode is powered by Claude Sonnet 4.6 doing the script generation, which feels appropriately on-brand for an episode about LLMs running wargames.
It really does. Okay, so let's start with the agent architecture, because that's the foundation everything else sits on. The class hierarchy is built around two base classes combined through Python multiple inheritance, and the design is cleaner than most research code you see in this space.
Walk me through the two base classes, because I think that split is where a lot of the interesting design thinking lives.
So you have Intelligent and Stateful. Intelligent is the workhorse — every agent that can generate output inherits from it. It holds a reference to an LLM object, a Database, verbosity settings, and a unique I/O ID which is just a random six-digit integer or UUID. The key method is return_output, which routes to one of three backends: return_from_ai, return_from_human, or return_from_preset. That three-way routing is actually one of the more elegant decisions in the codebase.
The preset option is the one that catches my eye. That's essentially a mock player for testing — you can inject scripted responses and get deterministic replays.
Which matters enormously for a research codebase. You can reproduce a specific game run, you can script historical scenarios, you can build regression tests. The Intelligent class also handles prompt construction through return_template, which assembles sections — persona, history, responses, query — and supports four query formats. The most important is twoline, which gives a full question-and-answer format with a persona reminder, so the model is constantly being told who it is before it responds.
And the retry logic in there — that's where it gets a bit alarming. Up to sixty-four retries if the LLM returns empty output?
Sixty-four. Which is a pragmatic hack that tells a story. If you're running a local Mistral-7B model on a gaming laptop, you're going to get empty outputs sometimes. The retry loop is a brute-force workaround. In production at scale, sixty-four retries could add serious latency, but for a research prototype running three to five moves in a wargame, it's probably fine. The stop token is interesting too — they use "Narrator:" as the stop token to prevent runaway generation. So the model literally stops generating when it would start narrating someone else's turn.
Then Stateful is almost comically minimal by comparison.
It's essentially a mixin that gives an agent a History object and two recording methods: record_narration and record_response. That's it. The separation is intentional — Intelligent handles I/O and LLM interaction, Stateful handles game state. Then Control and Player both inherit from both, which is Python's multiple inheritance model resolution working as intended.
And Team is the interesting outlier, because Team only inherits from Stateful and not from Intelligent.
Which is the Composite pattern in action. Team has no LLM of its own. When you call respond on a Team, it iterates over all its members calling member.respond, then calls leader.synthesize to combine the responses. From the Control's perspective, a Team looks exactly like a Player — same interface. That's duck typing doing exactly what it's supposed to do. You could have teams of teams, players on multiple teams, arbitrarily nested hierarchies — the paper explicitly says any acyclic arrangement is allowed.
The acyclic constraint is interesting. No cyclic dependencies — so you can't have two teams each waiting on the other's response. That's a reasonable constraint for a research prototype but it's worth flagging as a limitation if you wanted to model something like simultaneous negotiation rounds.
Exactly the kind of scenario where you'd hit the constraint. Okay, let's talk about how scenarios are actually defined, because the YAML schema is where a lot of the design philosophy becomes concrete.
The YAML-driven approach is interesting to me because it creates this clean separation between game logic and game content. A game designer doesn't need to touch Python code to define a new scenario.
The schema has title, scenario — which is a free-form prose block of two hundred to six hundred words — goals as named strings, moves as an integer, timestep as a string, nature as a boolean-ish parameter, and then players and advisors. The goals system is particularly clean: goals are named strings that are reusable across players and advisors, and they're concatenated at runtime onto the persona string. So a player's full identity is literally: "You are the leader of Crimsonia. Your goal is to unify the Crimsonian people, even if it requires starting a war."
That's the entire formal representation of a player's motivation — a concatenated prose string. No structured goal representation, no formal conflict detection, no guarantee of coherence. The LLM has to interpret all of it.
Which is simultaneously the system's greatest strength and its most obvious weakness. The strength is that you can represent any goal in any domain without writing domain-specific code. The weakness is that you have no formal guarantees about anything. The LLM might ignore the goals, might interpret them in unexpected ways, might hallucinate entirely new motivations during adjudication.
Speaking of the nature parameter — I want to flag this because it's one of the more interesting subtle bugs or features in the codebase depending on how you look at it. The documentation treats nature as a boolean, but the implementation uses it as a float probability in a random.random() comparison.
So nature equals True evaluates to 1.0 in Python, meaning unexpected consequences always get added to the adjudication prompt. But nature equals 0.5 would add them fifty percent of the time. This is an undocumented feature baked into the type coercion. A game designer who reads the README thinks they're flipping a boolean. A game designer who reads the source code discovers they have a probability dial. That's the kind of thing that emerges organically in research code and either becomes a documented feature or a latent bug depending on whether anyone notices.
Let's get into the orchestration layer, because the turn management design is where I have some opinions.
The game loop is a Python for loop. Moves from zero to self.moves, iterate over players, call player.respond, collect into a History object, call adjudicate, record the narration. No state machine, no event queue, no complex turn engine.
Which is almost aggressively simple for something being used in national security contexts.
And I think that simplicity is a deliberate choice that deserves credit. The alternative would be something like a formal game engine with discrete state transitions, action validation, consistency checking — and the paper's explicit thesis is that LLMs make all of that unnecessary for qualitative games. The LLM is the game engine. The prose history is the game state.
The History class is doing a lot of heavy lifting then. Walk me through how it actually works.
History is an ordered list of name-text pairs. But the interesting implementation detail is that it supports lazy async text — entries can hold coroutines, resolved later via asyncio.gather. It supports slicing, so history[-1] gives you the last entry as a new History object. It has two serialization modes: one that replaces the player's own name with "You" for first-person prompts, and one that strips names entirely. And it has a copy method that enables information asymmetry — you can give different players different views of the game history.
That information asymmetry capability is listed in the paper as a feature, but the implementation is basically just... pass a different slice of the history to each player.
There's no formal mechanism enforcing it. No access control, no cryptographic isolation, no audit trail. If a game designer forgets to slice the history correctly, players get full information. It works, but it works because the game designer is careful, not because the system enforces it.
The async architecture for handling human and AI players simultaneously is the part I find most architecturally interesting. The paper has this line: "AI is compute-bound and humans are I/O bound." That's a clean framing.
And the implementation follows directly from it. In UserDefinedGame, advisor chat sessions run concurrently with the main game loop using asyncio.TaskGroup. AI players respond sequentially — there's no parallel AI inference, which is an explicit choice to avoid hardware overload on the target deployment environment of a gaming laptop. Human players are detected via watchfiles.awatch, which is a filesystem watcher that blocks until the SQLite database file changes.
Using a database as a message bus with filesystem events as the notification mechanism is... not the first design I would reach for.
It's unusual, but the paper describes it as "future-proofing" — the API is designed so new user interfaces can be built without touching simulation code. The FastAPI endpoint writes to SQLite, the filesystem watcher fires, the game loop continues. It's decoupled in the right direction even if the mechanism is unconventional. And for a system that needs to run air-gapped on classified networks, avoiding WebSocket dependencies or cloud notification services has real practical value.
Let's talk about the design patterns, because there are several of them working together and I want to understand the rationale for each.
The Template Method pattern is the primary extension point. Control.call raises NotImplementedError — it's abstract, must be overridden in subclasses. Every game is a subclass of Control that overrides call. The run method wraps asyncio.run of self() for a clean synchronous entry point. This is a classic pattern for framework design: the framework defines the skeleton of the algorithm, subclasses fill in the specifics.
The Strategy pattern for LLM backends is where I'd focus if I were extending this. Four backends: llamacpp for local GGUF models, HuggingFace for local transformers, OpenAI cloud API, and Azure OpenAI. The default is Mistral-7B-OpenOrca in Q5_K_M quantization, auto-downloaded from HuggingFace.
The local model choice is explicitly motivated by classified use cases. The paper says: "The ability to run the model locally is crucial for use cases where information cannot be shared externally." This is a national security framework designed from the ground up for air-gapped environments. Q5_K_M quantization on Mistral-7B is light enough to run on a gaming laptop — they tested this. That's a concrete hardware constraint driving an architecture decision.
The LangChain dependency is worth flagging here because the pyproject.toml tells an interesting story. Every other dependency uses star — any version. LangChain is pinned to version 0.3.27 specifically.
That single pinned dependency tells you exactly what happened during development. LangChain's API broke something, probably between a minor version bump, and the team had to lock to a specific version to keep things working. LangChain's rapid API evolution is a known pain point in the LLM tooling ecosystem. The irony is that LangChain is providing prompt templates, chain composition, and LLM abstraction — exactly the kind of infrastructure you'd expect to be stable — and it wasn't.
The RAGTool and AskTool additions are interesting because they represent the system's extensibility story. RAGTool is an in-memory vector store retriever — players can have access to a document corpus. AskTool is an agent-as-tool, which enables recursive agent calls. But these are gated behind a level parameter in the Intelligent class.
The level parameter for tool gating is a clean design. You can give different agents access to different tools, or no tools at all, depending on the scenario. The ReAct reasoning integration via langgraph.prebuilt.create_react_agent is similarly optional — set self.reasoning equals True and the agent gets a reasoning loop before responding. These are capability additions that don't break the base interface.
Now I want to get into what I think is the most philosophically interesting part of the whole system, which is the hallucination-as-feature argument. Because every other LLM application is trying to minimize hallucination, and this one is explicitly designed to require it.
The paper's exact phrasing is: "Hallucination, which in other applications is so often harmful, is the key to making open-ended LLM wargames work. The well-grounded creativity required to generate plans and adjudicate their outcomes is hallucination by another name." The adjudicate method literally asks the LLM to invent plausible consequences. The nature events feature — adding "Include unexpected consequences." to the adjudication prompt — is designed to generate hallucinated developments like civil unrest, terrorism, third-party interventions.
And the randomized seed confirms this is intentional. They use random.randint from zero to sys.maxsize for the llamacpp backend, so every run produces different results. The paper demonstrates this with twenty runs of the Azuristan-Crimsonia scenario: dove-dove pairings produced armed conflict one out of twenty times, dove-hawk four out of twenty, hawk-hawk fourteen out of twenty.
That's a statistically meaningful result from a simple persona string. A previous study had concluded that LLMs were inadequate at accounting for player backgrounds. Snowglobe's empirical results directly contradict that. The hawk-hawk conflict rate of seventy percent versus the dove-dove rate of five percent shows the persona string — just a sentence or two — demonstrably changes outcomes in a predictable direction. That's a significant finding about LLM persona effectiveness.
The RAND comparison in the paper is worth sitting with for a moment. RAND's RSAS system from the 1980s — their escalation model framework — required a full year of work to build fewer than half a dozen escalation models from prose descriptions. The paper quotes the original authors saying it would "occupy them through most of 1984." Snowglobe feeds the prose description directly to the LLM and gets a working model in seconds.
That's the productivity gain from LLMs in this domain made concrete. And it's not just speed — it's also that the RAND system required domain experts to formalize the prose into a computational model. Snowglobe skips that formalization step entirely. The prose is the model.
Which brings us to the tradeoffs, because skipping formalization has real costs.
The most fundamental tradeoff is text-only state. There is no formal game state beyond the text history. No structured data, no game board, no resource counters. Everything the LLM needs to maintain across moves — territorial control, alliance status, military readiness, economic conditions — has to be inferred from the context window. The LLM must maintain all implicit state through its context window, which means state consistency is not guaranteed.
The paper acknowledges a specific failure mode in adjudication: the LLM can hallucinate new player plans as part of the output. So the adjudication step might describe a player doing something they didn't actually plan. There's no mechanism to enforce that adjudication respects player-stated plans.
No action validation whatsoever. Player responses are free-form text. The adjudication is free-form text. The connection between the two is maintained by the LLM's attention mechanism, not by any formal constraint. For qualitative wargaming this is probably acceptable — the goal is plausible narrative, not formal correctness. But it means you can't use this system for anything that requires auditability or formal verification.
The sequential AI player processing is another tradeoff worth naming explicitly. AI players respond one at a time to avoid hardware overload. In a game with eight AI players, you're waiting for eight sequential LLM calls per move. There's no batching, no parallel inference across players.
Which is a direct consequence of targeting a gaming laptop as the deployment environment. If you're running on a Modal cluster with multiple GPUs, you'd redesign this. The constraint is hardware-driven, not architecturally fundamental. But it does mean that as the number of AI players scales up, turn resolution time scales linearly. For a three-move game with four players, that's twelve sequential LLM calls. For a ten-move game with twenty players, you're looking at two hundred sequential calls.
I want to talk about the HAIWIRE example because it's a delightful design decision that reveals something about the team's background. They're using tabletop RPG mechanics.
The HAIWIRE example — which is a tabletop exercise simulation about AI incident response — includes an actual ten-sided die roll. The code does random.randint(1, 10) to determine if a response succeeded. If the roll is ten or higher, full success and the game ends. If eight or higher, partial success. This is directly imported from tabletop RPG resolution mechanics. The team's wargaming roots are showing.
And it works. The stochastic outcome determination is exactly what you want in a wargame simulation — you're not trying to find the one correct answer, you're exploring a distribution of possible outcomes. The die roll makes that explicit in a way that's immediately legible to anyone with tabletop gaming experience.
The empty snowglobe.py file is another artifact worth noting. The file at src/llm_snowglobe/snowglobe.py is essentially empty — just a copyright header and two commented-out lines. The public API is entirely through submodules. This tells you the architecture evolved significantly from an initial monolithic design to the current modular structure. The file is a fossil of an earlier approach.
The abstraction layers are clean when you look at the full stack. You've got LLM backend at the bottom, LangChain above that, then the LLM class for backend selection, then Intelligent for prompt construction and output routing, then Player and Control and Team for game-specific behaviors, then History for state representation, then Database for persistence and message passing, then Configuration for scenario parsing, then UserDefinedGame and UserDefinedSim for YAML-driven orchestration, then the FastAPI layer, and finally examples and custom games at the top.
Eleven layers. For a research prototype, that's actually well-structured. Each layer has a clear responsibility. The public API from the package's init file exports exactly what you need: UserDefinedGame for YAML-driven games, and then the core classes for building custom games. The examples are genuine starting points, not toys — the Azuristan-Crimsonia simulation is the same scenario used in the published research.
So what's the practical takeaway here for someone who wants to build a multi-agent LLM system? What does Snowglobe teach you?
A few concrete things. First, the persona-as-string approach is more powerful than it looks. You don't need formal goal representations or structured motivation systems — a well-crafted sentence or two demonstrably shapes LLM behavior in statistically meaningful ways. Second, the Composite pattern for agent hierarchies is the right abstraction. Presenting Teams and Players through the same interface gives you composability without complexity. Third, the Template Method pattern for game types is the right extension point — define the skeleton, let subclasses fill in the specifics.
The tradeoff I'd push back on is the text-only state. For qualitative games, it's fine. But if you wanted to use this framework for anything with resource tracking or formal consistency requirements, you'd need to add a structured state layer alongside the text history. The History class could hold structured data alongside prose entries — that seems like the natural extension.
The LangChain pinning issue is also a practical lesson. If you're building on top of a rapidly evolving framework, pin your dependencies aggressively and document why. The single pinned dependency in pyproject.toml is the most honest comment in the entire codebase — it says "this version specifically, because we got burned."
The air-gap use case as a first-class design constraint is something I don't see often enough in AI systems. Most LLM frameworks assume cloud API access. Snowglobe builds local model support as a primary path, not an afterthought, because the use case demands it. If you're building for regulated industries, defense, healthcare, anything where data can't leave the network — that design discipline matters from day one.
The archival raises an obvious question about what comes next. The repository was archived March 18th, just weeks ago. The possible explanations are: the project was absorbed into a classified system and the open-source version is being sunset, the team moved on to other work, or it was superseded by more capable frameworks. Given that it got a CIA journal publication in December 2025 and was still actively maintained through v1.0.0 in September 2025, the classified absorption theory seems most plausible to me.
The timing of v1.0.0 in September and then archival in March is about six months. That's consistent with a project reaching a stable enough state to hand off to an operational team who then takes it behind the classification boundary.
Which would actually be a success story, not an abandonment. The open-source version served its research purpose — demonstrate the concept, publish the paper, run a real exercise, get the CIA journal writeup — and then the operational version lives somewhere else. The Apache 2.0 license means anyone can fork it and continue development.
The fifty-three stars and nineteen forks for a niche national security research tool is actually reasonable. This isn't a general-purpose framework competing with LangGraph or AutoGen. It's a specialized tool for a specific use case, and the people who care about that use case know it exists.
The arXiv paper has been cited and the Studies in Intelligence publication is a meaningful validation signal. For the intelligence community, getting something into that journal is the equivalent of a top-tier conference paper.
If you're a software engineer who works on multi-agent systems and you haven't looked at the IQT Labs GitHub, it's worth an afternoon. The Snowglobe codebase is compact — roughly ninety-four percent Python, well under ten thousand lines of actual logic — and the design decisions are legible. You can read the whole thing in a few hours and come away with concrete patterns worth applying.
The HAIWIRE example in particular is a good starting point. It's a complete working simulation of a tabletop exercise, it shows how to define injects from a CSV file, how to use the d10 roll for outcome determination, how to structure a multi-move game with a clear termination condition. It's the kind of example that teaches by being a real thing rather than a tutorial toy.
Alright, I think that's a solid tour of the architecture. The bottom line: Snowglobe is a well-structured research prototype that made it to operational deployment, built around the insight that LLM hallucination is a feature rather than a bug for qualitative wargaming. The design patterns are sound, the tradeoffs are documented, and the empirical results are meaningful. The archival is probably a sign of success, not failure.
And it's a good example of what thoughtful research code looks like — not production-hardened, but principled. The design decisions have clear rationale, the abstraction layers are clean, and the extension points are in the right places.
Thanks as always to our producer Hilbert Flumingtop for keeping this operation running. Big thanks to Modal for providing the GPU credits that power this show. This has been My Weird Prompts — if you're enjoying the show, a quick review on your podcast app helps us reach new listeners. Until next time.
Take care, everyone.