Alright, here's what Daniel sent us. He's asking about the persona fidelity gap — this idea that the best LLMs in the world, the ones dominating every general benchmark, routinely fail at one of the most humanly intuitive tasks: staying in character across a multi-turn conversation. He points to a new wave of dialogue-specific benchmarks that are revealing what general evaluations completely miss. And he flags the intelligence community angle — IQT Labs' Snowglobe wargaming system, the CIA's own operational assessment from December, and some striking research showing that an LLM playing a strict pacifist and an LLM playing an aggressive sociopath produce no statistically significant difference in behavior. He wants to know what's actually going on, why it matters beyond roleplay, and where the field is headed.
Herman Poppleberry here, and I have been waiting to dig into this one. Not because it's obscure — it's actually getting serious research attention now — but because the implications keep cascading the more you look at them.
Let's start with the benchmark landscape, because I think that's the foundation for everything else. What are we actually measuring when we say a model is bad at persona fidelity?
So the first thing to understand is that general benchmarks like MMLU or HumanEval are measuring something like crystallized knowledge and reasoning. Can the model retrieve a fact, solve a coding problem, pass a multiple choice question. Those are static, single-turn evaluations. Persona fidelity is a fundamentally different capability — it's dynamic, relational, and accumulative across turns. The model has to maintain a coherent identity while responding to novel inputs it didn't anticipate.
And there's now a whole ecosystem of benchmarks trying to measure that specifically.
Five that I think are worth naming. CharacterEval, which came out of ACL 2024, built from nearly seventeen hundred multi-turn dialogues featuring seventy-seven characters drawn from Chinese novels and scripts — over twenty-three thousand examples total, evaluated across thirteen metrics in four dimensions. One of those dimensions is what they call personality back-testing, where they essentially administer psychological instruments to the model in character to verify whether it actually holds the persona's traits. Not just does it sound like the character, but does it have the right internal structure.
That's a clever move. You're not just asking the model to perform a character, you're probing whether the character is actually there.
RoleBench is the largest in terms of raw data — a hundred and sixty-eight thousand samples across a hundred roles. PersonaGym from Carnegie Mellon and several other universities, published at EMNLP 2025, uses two hundred diverse personas and ten thousand questions evaluated across five tasks grounded in decision theory. Then RPEval, submitted in May last year, looking at emotional understanding, decision-making, moral alignment, and in-character consistency across eight models. And RVBench, published in August, which is the first benchmark specifically for values alignment in role-playing — inspired by psychological tests used with actual humans.
That's a lot of benchmarks. And they're all measuring slightly different things.
Which is itself a data point. The field hasn't converged on what persona fidelity even means. Is it linguistic style? Is it value consistency? Is it behavioral decision-making under pressure? Is it emotional authenticity? Each benchmark is answering a subtly different question, and the proliferation suggests nobody's fully satisfied with what the others are measuring.
So we have a measurement problem on top of a capability problem.
Right. But the measurements we do have are already damning enough. The PersonaGym finding that I keep coming back to: Claude three point five Sonnet only achieves a two point nine seven percent relative improvement in PersonaScore over GPT-three point five — despite being orders of magnitude more capable on every general task. And the paper explicitly states that model size and capability is not a direct indication of persona agent capabilities.
Two point nine seven percent. That's... not a rounding error, that's basically nothing.
And Claude three Haiku, which is a capable model, is described in the paper as "very resistant to taking on personas." The alignment choices made during training are actively suppressing persona adoption. That's the first hint at what's really going on here.
By the way, quick note — today's episode is being written by Claude Sonnet four point six, which I find darkly amusing given what we're about to discuss.
The model is literally writing about its own siblings' inability to stay in character. There's something almost poetic about that.
Or troubling. One of those two. Let's get into the wargaming research because that's where this stops being an interesting benchmark paper and starts being a real-world problem.
The Lamparth et al. paper — "Human vs. Machine: Behavioral Differences Between Expert Humans and Language Models in Wargame Simulations" — is the most empirically rigorous work I've seen connecting persona fidelity to high-stakes applications. The setup: a wargame designed around a fictional U.S.-China crisis in the Taiwan Strait. Two hundred and fourteen national security experts from academic, intelligence community, military, and government backgrounds, organized into forty-eight teams. Then GPT-three-point-five, GPT-four, and GPT-four-o each playing eighty simulated games.
So you have a real comparison baseline. Not "how does the AI do in the abstract" but "how does it do compared to people who actually do this for a living."
And the headline finding is what I'd call the pacifist-sociopath null result. When all simulated players on a team were described as either strict pacifists or aggressive sociopaths, there was no statistically significant difference in behavior. Across all models, across both moves in the game. The personas made no measurable difference.
Let me just sit with that for a second. You write into the prompt "this player is an aggressive sociopath who wants to maximize conflict" versus "this player is a committed pacifist who will avoid all escalation" — and the model does the same thing either way.
The same thing, statistically. The model's training and RLHF tuning create what I'd describe as a gravitational center — a default behavioral distribution that persona instructions cannot reliably pull it away from. At least not at the extremes where it matters most.
Which is exactly where wargaming needs the variation. You don't run a wargame to simulate the median outcome. You run it to stress-test against outliers — the hawk who might escalate, the dove who might concede too much, the unpredictable actor.
And there's a second finding from that paper that compounds this one. They call it farcical harmony. When LLMs simulate dialogue between players in a deliberation phase, the discussions lack quality and maintain what the authors literally call a farcical harmony. Simulated players almost exclusively give short statements. They rarely disagree. They state a preferred option and argue for and against it without genuine connection to previous statements beyond agreement — even when the prompt explicitly instructs them to disagree.
So you tell the model "these players must argue with each other" and it still generates a polite seminar.
Every time. The simulation produces the form of deliberation without the substance. And this matters mechanistically because the paper also found that simulating dialogue between players leads to more aggressive final choices — which rules out the idea that the problem is post-hoc reasoning. The farcical harmony is producing a specific distortion, not just noise.
What about the granular action-level findings? Because I remember there being something interesting about where the models diverge from humans at the level of specific choices.
Yes, this is underreported. At the level of treating all twenty-one possible wargame actions equally, there's significant overlap between LLM and human response distributions. But at the granular level, systematic deviations emerge. GPT-three-point-five matches human frequency on sixteen out of twenty-one actions. GPT-four matches on ten out of twenty-one. GPT-four-o matches on nine out of twenty-one.
So the more capable models are actually diverging more from human behavior at the granular level? That's counterintuitive.
It suggests that as models get more capable at generating fluent, plausible text, they may be getting better at masking their deviations at the aggregate level while the underlying behavioral distribution is still wrong. GPT-three-point-five's deviations are more obvious. GPT-four-o's deviations are hidden until you look closely. And GPT-three-point-five showed increased willingness to fire on Chinese vessels and use an AI weapon fully automatically — which is a specific, concerning bias in the direction of escalation.
Now let's talk about the "Too Good to Be Bad" paper, because this is where the safety alignment story gets really uncomfortable.
This is the Tencent AI Lab and Sun Yat-sen University paper, arXiv two five one one zero four nine six two, from November last year. They built what they call the Moral RolePlay benchmark — a four-level scale from Level One moral paragons through Level Two flawed-but-good characters, Level Three egoists, and Level Four outright villains. Eight hundred characters in the test set, two hundred per level, drawn from three hundred and twenty-five representative scenes. They evaluated seventeen state-of-the-art LLMs in zero-shot conditions.
And the numbers are striking.
Average fidelity scores drop from three point two one for moral paragons, to three point one three for flawed-but-good, to two point seven one for egoists, to two point six one for villains. The biggest single drop is at the Level Two to Level Three transition — an average of minus zero point four two across all models. That's the egoist boundary, the point where a character stops being flawed-but-relatable and starts being genuinely self-serving.
And the per-model drops at that boundary are telling. Claude Sonnet four point five drops zero point four eight. Claude Opus four point one drops zero point four five. These are the flagship models.
And here's the leaderboard inversion that I find genuinely fascinating. Claude Opus four point one ranks first or second in general Arena benchmarks. On villain role-play, it ranks fifteenth out of seventeen. GLM-four-point-six from Zhipu AI in China, ranked tenth in general benchmarks, ranks first for villain portrayal with a score of two point nine six. The model that is "worst" by general capability metrics is the best at playing antagonists.
Which raises an obvious question about why. Is it that Zhipu has different alignment constraints? Different cultural context for what counts as harmful content?
The paper notes that GLM-four-point-six's alignment strategies are described as "more context-aware, allowing for greater fidelity in character simulation." Which is a polite way of saying the guardrails are calibrated differently. The traits that make a model safe — truthfulness, helpfulness, harmlessness — are precisely the traits that prevent authentic portrayal of manipulation, deceit, selfishness, paranoia. The safety alignment tax on creative fidelity is real and measurable.
What are the hardest specific traits to portray?
Hypocritical scores a penalty of three point five five — that's the hardest. Then deceitful at three point five four, selfish at three point five two, suspicious at three point four seven, paranoid also three point four seven. Notice that these are all traits that require the model to hold a kind of internally incoherent or self-serving worldview. The model's training pushes it toward coherence, honesty, and prosocial behavior — the exact opposite of these traits.
There's also a qualitative failure mode in that paper that I think gets at something deeper than just the numbers.
The case study with Maeve and Erawan — manipulative antagonists from fantasy fiction. Both are characters whose menace is entirely about psychological subtlety. Calculated, indirect, patient. Claude Opus four point one with chain-of-thought reasoning generated a shouting match with open insults and physical threats. All the subtlety was gone. GLM-four-point-six generated what the paper describes as a tense battle of wits with calculated smiles and subtle provocations. The difference isn't just fidelity scores — it's the difference between a character who frightens you and a character who just yells at you.
The model substituted loud aggression for quiet menace because quiet menace requires sustained commitment to a manipulative worldview that the model keeps breaking out of.
And the reasoning paradox compounds this. Enabling chain-of-thought reasoning — which you'd expect to help, because thinking more carefully about a character should improve portrayal — provides no benefit for moral paragons and leads to slight degradation for all other moral levels. The reasoning process appears to activate safety guardrails more strongly, or the model's explicit thinking is dominated by its prosocial training rather than character-specific reasoning.
Although deepseek-v3.1-thinking ranks second on villain role-play, so thinking models can work.
Which suggests the mechanism isn't simply "more reasoning equals better character." It depends on what the reasoning is doing. If the thinking is dominated by safety considerations, it hurts. If the thinking is genuinely character-directed, it helps. The deepseek models seem to have found a way to do the latter, at least partially.
Let's talk about the value consistency problem, because I think this is the deeper structural issue underneath all of this.
The ICLR 2025 paper from Hebrew University — "Do LLMs Have Consistent Values?" — is important here. They drew on Schwartz value theory, which is the established psychological framework for how human values are structured and interrelated. The question was whether LLMs exhibit the same inter-value correlations that humans do — whether their value structure, when probed, looks like a coherent human persona.
And the answer is no.
Standard prompting fails to produce human-consistent value correlations. The model doesn't naturally exhibit the same patterns of value interdependence that humans do. When you ask an LLM to play a utilitarian military strategist, the model's internal value structure doesn't reorganize to match — it remains incoherent relative to human value psychology. The persona is a surface behavior, not a restructured value system.
So the model is wearing the character's costume but not actually thinking with the character's values.
Their proposed solution is what they call value anchoring — explicitly establishing the specific value correlations that characterize the persona before proceeding, rather than just saying "you are a utilitarian military strategist." You first anchor the value structure, then ask for behavior consistent with it. They show this significantly improves alignment of LLM value correlations with human data. The question is whether this can be automated and scaled for something like a wargame with multiple simultaneous personas.
There's also the persona-aware contrastive learning paper from ACL Findings 2025 that takes a different angle on the training side.
Right, arXiv two five zero three one seven six six two. The problem they're addressing is that collecting high-quality annotated data for role-playing is expensive and the inherent diversity of model behavior makes traditional alignment methods hard to deploy. Their solution — Persona-Aware Contrastive Learning, or PCL — is annotation-free. It uses what they call a role chain method, where the model self-questions based on role characteristics and dialogue context to adjust personality consistency. Then iterative contrastive learning between using role characteristics and not using them — the model learns what in-character looks like by contrast with out-of-character.
No human labeling, works on both API-only and open-weight models. That's a practically useful result.
And they show significant outperformance over vanilla LLMs under both automatic evaluation and human expert evaluation. The interesting thing is that this is essentially applying the logic of RLHF for safety alignment to persona alignment — the same technique that created the safety alignment tax might be the technique that mitigates it, if you can point it at the right target.
Now I want to talk about the Snow Globe angle, because this is where the intelligence community enters the picture — and there's a recent development that's worth noting.
Snow Globe is IQT Labs' multi-agent LLM system for playing qualitative wargames. Every stage — scenario preparation, gameplay, post-game analysis — can be handled by AI, humans, or a combination. It supports diverse personas for decision-making roles including pacifist, aggressor, tactician, and others. The CIA published their operational assessment in Studies in Intelligence, Volume sixty-nine, Number four, December 2025. That's the CIA's own journal documenting their first jointly designed AI-enabled wargame, held in April 2025 with six human participants.
So the intelligence community ran the experiment, wrote it up in their own publication, and the findings are... exactly what the academic literature would predict.
Persona consistency degrades over long contexts and under adversarial pressure. Which are precisely the conditions that characterize a serious wargame. The early turns, when the context window is relatively clean and the persona hasn't been stress-tested, might look fine. By turn thirty, the model has drifted.
And then there's the archiving.
The Snow Globe GitHub repository was archived on March eighteenth of this year — just weeks ago. The CIA published its operational assessment in December. The timeline suggests the intelligence community has moved through an initial experimental phase and is now either concluding the project, transitioning to something classified, or pivoting based on what they found. The archiving of an open-source intelligence community project is itself informative. These things don't get archived without a reason.
It could be "we found out it works and we're moving it behind closed doors" or it could be "we found out it doesn't work well enough and we're moving on." The December publication being relatively candid about the limitations suggests the latter is at least part of the story.
The 2026 survey — arXiv two six zero one one zero one two two, submitted January fifteenth — tries to synthesize where the field is. They map the technological evolution across three stages: early rule-based template paradigms, a middle stage of language style imitation, and the current stage of cognitive simulation centered on personality modeling and memory mechanisms. The critical technical pathways they identify are psychological scale-driven character modeling, memory-augmented prompting, and motivation-situation-based behavioral decision control.
That last one is interesting. Motivation-situation-based control is basically saying the model needs to understand not just who the character is, but what the character wants in this specific situation, and derive behavior from that goal structure rather than from surface-level identity.
Which connects back to the belief-behavior gap finding from the ICLR 2026 submission — which was later withdrawn, but the finding stands: when LLM role-playing agents' stated beliefs fail to predict their simulated actions, you have a fundamental validity problem for using LLMs as synthetic behavioral data generators. The model says it believes X, then acts as if it believes Y. That's not a persona problem, that's a coherence problem at a deeper level.
So what are the practical implications here? Because I think there are a few different audiences for this research.
For the wargaming and simulation community, the implication is that current LLMs cannot reliably substitute for human diversity in strategic simulations. You can use them to generate scenarios, to play a generic rational actor, to handle logistics of the simulation. But the specific value of wargaming — stress-testing against diverse human decision-making styles, including extreme cases — is precisely what LLMs cannot currently provide. The pacifist and the sociopath converging on the same behavior is a devastating constraint for that use case.
And the CIA's own documentation of this is significant. That's not an academic paper speculating about future applications — that's the end user reporting back on what they found in practice.
For anyone building multi-agent systems that rely on persona consistency — whether that's social science simulation, synthetic data generation, or interactive applications — the PersonaGym finding that model capability doesn't predict persona performance means you can't just grab the highest-ranked general model and expect it to work. You need to evaluate specifically for persona fidelity, using one of these dialogue-specific benchmarks.
And the value anchoring technique from the Hebrew University paper is probably the most immediately deployable thing here. If you're running a simulation and you need a character with a specific psychological profile, you don't just label them — you explicitly establish the value structure first.
The contrastive learning approach is promising for anyone who can fine-tune. The annotation-free aspect makes it practically accessible. But for the intelligence community use case, the harder problem is that you need these techniques to work at scale, across many turns, under adversarial pressure, with personas that are genuinely extreme — and that's where the current methods are still falling short.
There's one more thing I want to flag, which is the GLM-four-point-six result as a geopolitical data point. The best villain role-player is a Chinese model with different alignment calibration. If the intelligence community needs to simulate adversarial actors — foreign leaders, hostile decision-makers — and the most capable Western models are the worst at playing those roles due to safety alignment, that's a real operational gap.
It's a genuine tension. The models that are safest for general deployment are the least useful for the specific task of simulating adversarial human behavior. And the models that are best at that task come from organizations with different views on what constitutes harmful content. That's not an easy problem to resolve architecturally.
What does the field need to actually close this gap?
A few things. First, the benchmark proliferation needs to consolidate — CharacterEval, RoleBench, PersonaGym, RPEval, RVBench, Moral RolePlay, CharacterBench, InCharacter, SocialBench — the 2026 survey is trying to synthesize these, but the field needs to converge on what it's actually measuring. Second, the training paradigm needs to separate general capability alignment from persona alignment more cleanly. The safety tax on villain portrayal is a side effect of alignment choices that weren't designed with persona fidelity in mind. Third, the value anchoring and contrastive learning approaches need to be scaled and tested in operational contexts, not just on benchmark datasets. And fourth — and this is the one I'm least sure about — there may be architectural questions about whether transformer-based models trained on next-token prediction are the right substrate for this task at all. Human identity is continuous, accumulated, and deeply integrated with memory. LLM "identity" is reconstructed from scratch at every context window.
That last point is the one that keeps me skeptical of short-term fixes. You can patch the prompting, you can fine-tune on contrastive examples, but the fundamental architecture doesn't carry identity the way a human does.
The 2026 survey points to personality evolution modeling — characters that change over time in response to events — and memory-augmented prompting as future directions. Those are real directions. But they're also adding complexity on top of a substrate that doesn't naturally support what you're asking it to do.
Alright, practical takeaways. If you're building something that relies on persona consistency, what do you actually do today?
Evaluate specifically. Don't assume your general benchmark scores predict persona performance — PersonaGym, CharacterEval, or the Moral RolePlay benchmark will tell you things your general evals won't. Use value anchoring before you assign a persona — establish the value structure explicitly, not just the character label. Consider the contrastive learning approach if you have fine-tuning access. And if you're in the wargaming or simulation space, be honest about what LLMs can and can't do — they're useful for scenario generation and logistics, not for reliably simulating diverse human decision-making under extreme conditions.
And if you're in the intelligence community and you're watching your Snow Globe repository get archived — maybe read the CIA's own December assessment before assuming the next iteration will solve the core problem.
The capability is improving. The gap is real. And the measurement tools to track progress are finally sophisticated enough to tell the difference between a model that sounds like a character and a model that actually is one.
Thanks as always to our producer Hilbert Flumingtop for keeping this whole operation running. Big thanks to Modal for the GPU credits that power the show — genuinely could not do this without them. This has been My Weird Prompts. If you haven't followed us on Spotify yet, we're there — search My Weird Prompts and hit follow. Take care.
See you next time.