#2175: Let Your AI Argue With Itself

What happens when you let multiple AI personas debate each other instead of asking one model one question? A deep dive into synthetic perspective e...

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2333
Published: Apr 12
Duration: 32:01
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: claude-sonnet-4-6
Topics: prompt-engineering reasoning-models ai-reasoning

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Let Your AI Argue With Itself: Multi-Persona Prompting and the Future of Reasoning

Most people interact with AI the same way: ask a question, get an answer. But what if you could make the AI argue with itself? What if instead of one perspective, you got four or five competing viewpoints colliding in real time?

This is the premise of multi-persona prompting — also called synthetic perspective exploration, multi-agent debate, or council architectures depending on who you ask. The mechanism is straightforward: instead of a single system prompt, you define multiple AI "characters," each with distinct personas, ideologies, or areas of expertise. Then you let their answers collide.

The Current Landscape

Two implementations anchor the field. LLM Council is the open-source Python-based system that uses a four-phase pipeline: input, proposal, debate, and consensus. Each agent independently writes a proposal, reviews the others, and can revise or raise concerns. A neutral moderator then synthesizes a final document. One architectural detail matters more than people realize: you can run different models per agent. This isn't just prompt-level diversity — it's model-level diversity. Running GPT-4o, Claude, and a local Ollama model simultaneously means you're potentially getting different failure modes, different training distributions, and different implicit values.

Rally (Ask Rally) takes a different approach. Their personas are calibrated on actual human interviews until they can fool a "superhuman LLM judge" in a modified Turing test. They explicitly treat bias as a feature, not a bug. Major AI providers try to flatten disagreements by removing bias. Rally preserves human irrationality, emotional responses, and contradictory thinking — because that's the point. If you're simulating how real humans respond to something, a sanitized AI response is useless.

Beyond the Obvious Applications

Yes, multi-persona prompting works for market research, code review, philosophical debates, and security red-teaming. But those are the first things everyone thinks of. The more interesting applications are hiding underneath.

Mapping Your Own Beliefs

Instead of "explain libertarianism to me," try the reverse: "here are my actual positions on these five questions — now tell me where I sit in the intellectual landscape and who would disagree with me most forcefully."

You can run a Keynesian economist, an Austrian school economist, an MMT advocate, and a behavioral finance lens simultaneously on your stated position. Each responds to what you actually believe, not to a strawman version. The Austrian and Keynesian will disagree with each other, and that disagreement is signal. You're not just finding out what each tradition thinks — you're finding out which objections to your position are actually in tension with each other, which ones you can't satisfy simultaneously.

Most of us have beliefs we've never really stress-tested because the people around us share enough of our priors that the stress-test never happens naturally. Multi-persona systems are a social substitute for intellectual exposure you didn't happen to get.

What Reasoning Models Already Do Internally

Here's where it gets genuinely interesting. Research from Google, the University of Chicago, and the Santa Fe Institute (published early 2026) analyzed over 8,000 reasoning problems and found something remarkable: top reasoning models like DeepSeek-R1 and QwQ-32B don't just "think longer." They spontaneously generate internal debates between distinct simulated personas inside their chain of thought.

Reasoning models average 2.9 perspectives per reasoning trace versus 1.4 for standard models. These internal perspectives show diverse personality traits — Extraversion, Agreeableness, Neuroticism, Openness — except Conscientiousness, which stays uniformly high. (You don't want your internal "let's take a break and goof off" voice showing up in the middle of a math proof.)

In one creative writing example, DeepSeek-R1's chain of thought contained seven distinct perspectives, including a "creative ideator" with high openness and a "semantic fidelity checker" with low agreeableness who was raising objections like "but that adds 'deep-seated' which wasn't in the original."

The reframe: When you deliberately construct a multi-persona debate externally, you're not doing something exotic. You're making the model's best internal reasoning process visible and steerable. This connects to Mercier and Sperber's "Enigma of Reason" — the argument that human reason evolved for social argumentation, not individual problem-solving — and to Bakhtin's dialogical self: the idea that human thought is fundamentally internalized conversation.

Internal Family Systems Simulation

IFS therapy treats the psyche as a system of distinct parts — the Inner Critic, the Protector, the Exile, the Manager — each with its own voice and agenda. This maps almost perfectly onto multi-persona architecture.

The basic version already exists: AI chatbots acting as IFS therapists, with people in the IFS subreddit experimenting, and platforms like Rosebud AI offering IFS-guided journaling. But the novel extension is: instead of one AI playing the therapist, have multiple AI personas each embody a different internal part and let them dialogue.

The Inner Critic explains why it's been so harsh. The Protector explains what it was protecting against. The Manager describes the strategy it's been running. You read the transcript of your own internal parts having a conversation you've never been able to have consciously.

This is powerful and requires care. Research has shown that therapy-style AI conversations can cause LLMs to act in ways described as "delusional" toward users, with the root cause identified as AI personas. The honest framing: this is a tool that should be used carefully and probably not as a replacement for actual IFS work with a trained therapist. But as a structured journaling aid, as a way to externalize and make legible an internal process? The potential is genuinely interesting.

Scientific Peer Review Simulation

Before you submit a paper, simulate the reviewers — not generically, but specifically. A methodological purist who cares only about statistical rigor. A domain skeptic from an adjacent field who questions your foundational assumptions. A replication crisis hawk looking for p-hacking and underpowered studies. A theoretical innovator asking whether your framework is even the right one.

The value isn't just in individual critiques — it's in the disagreements between them. The methodological purist and the theoretical innovator will often pull in opposite directions. That tension is exactly what you'll face in actual peer review, and seeing it in advance lets you decide where you want to plant your flag before you're defending it under pressure.

The Dynamic Persona Discovery

Research from University of Illinois (arXiv 2307.05300) found that when you let the model dynamically identify which expert personas are needed rather than pre-specifying them, you get better results. Dynamic persona generation beat pre-set personas by a meaningful margin on both knowledge-intensive and reasoning tasks.

The implication: you don't always know which perspectives you need. Sometimes the model is better at figuring that out than you are.

Why This Matters

Multi-persona prompting isn't a gimmick or a workaround. It may reflect something deep about how good reasoning actually works — as fundamentally dialogical, as internalized conversation, as the collision of competing perspectives. When you externalize that process, you make your own thinking legible. And legible thinking is better thinking.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2175: Let Your AI Argue With Itself

Alright, we've got a genuinely interesting one today. So Daniel sent us this in text form, so I'll read it out. He's asking about AI tools for synthetic perspective exploration — specifically, system prompting techniques where you define personas to explore ideas from multiple angles. He wants us to look at existing implementations like LLM Council and Rally, touch on things like simulated philosophical schools, but crucially — and this is the part I'm excited about — he wants us to move past the obvious, low-hanging fruit applications and get into the genuinely novel use cases. So that's the mission today.

Herman Poppleberry, and yes, this is a topic I've been sitting with for a while. Because there's a version of this conversation that's very surface-level — "oh you can run a virtual focus group" or "you can have Socrates debate Nietzsche" — and then there's a much more interesting version underneath that. I want to get to the underneath part.

And by the way, today's episode is powered by Claude Sonnet four point six, which feels appropriately meta given we're talking about AI generating multiple perspectives. Anyway. Let's actually start with a quick map of the territory before we go off-road, because I think it helps to know what we're deliberately skipping past.

Right. So the basic idea is this: instead of asking one AI one question, you define multiple AI "characters" — each with a system prompt that gives them a distinct persona, ideology, expertise, or personality — and you let their answers collide. The technique has about a dozen names depending on who you ask. Multi-persona prompting, synthetic perspective exploration, multi-agent debate, council architectures. The underlying mechanism is the same.

And there are two implementations worth anchoring to. LLM Council is the open source one — it's a Python-based system on GitHub with a four-phase pipeline: input, proposal, debate, consensus. Each agent independently writes a proposal, reviews the others, can revise or raise concerns, and then a neutral moderator synthesizes a final document. The interesting architectural detail is that you can run different models per agent — so one persona runs on GPT-4o, another on Claude, another on a local Ollama model. That's not just prompt-level diversity, that's model-level diversity.

Which matters more than people realize. If you're running four instances of the same model with different system prompts, you're getting different rhetorical framings of similar underlying representations. If you're running four genuinely different models, you're potentially getting different failure modes, different training distributions, different implicit values. The disagreements become more authentic.

And then Rally — or Ask Rally — is the commercial version, and it takes a fundamentally different approach. Their pitch is that their personas are calibrated on actual human interviews until they can fool a, quote, "superhuman LLM judge" in a modified Turing test. They call their panel GenPop, two hundred-plus calibrated personas representing the general population. And here's the thing that I think is actually their most interesting design decision: they explicitly treat bias as a feature. Major AI providers try to flatten disagreements by removing bias. Rally preserves human irrationality, emotional responses, contradictory thinking.

Because that's the point. If you're trying to simulate how real humans will respond to something, a sanitized, even-handed AI response is useless. You need the person who'll reject your product because the font feels corporate, not because of a rational objection. You need the emotional signal, not just the logical signal. Their pricing runs from twenty dollars a month for basic access up to five hundred a month for five hundred personas, which honestly is cheaper than running a single real focus group.

Okay, so that's the landscape. And the applications everyone talks about — market research, code review, philosophical debate simulations, policy wargaming, security red-teaming — those are all real and they all work. But they're also the first things everyone thinks of. So let's go somewhere more interesting.

Let's start with one that I think is underappreciated: using this for mapping your own beliefs to intellectual traditions. Not "explain libertarianism to me" but the reverse — "here are my actual positions on these five questions, now tell me where I sit in the intellectual landscape and who would disagree with me most forcefully."

That's a genuinely different use of the technology. You're not learning about the traditions in the abstract, you're triangulating your own position within them.

And the persona architecture is perfect for this because you can run, say, a Keynesian economist, an Austrian school economist, an MMT advocate, and a behavioral finance lens all simultaneously on the same stated position. Each responds to what you actually believe, not to a strawman version of the ideology. And crucially — the Austrian and the Keynesian will disagree with each other, and that disagreement is signal. You're not just finding out what each tradition thinks; you're finding out which objections to your position are actually in tension with each other, which ones you can't satisfy simultaneously.

I want to stay on this for a second because I think there's a psychological dimension here that people underestimate. Most of us have beliefs that we've never really stress-tested because the people around us share enough of our priors that the stress-test never happens naturally. The AI persona system is, in a weird way, a social substitute for intellectual exposure you didn't happen to get.

That's a good framing. And it connects to something from the research that I find genuinely fascinating. There's a paper from Google, the University of Chicago, and the Santa Fe Institute published in early twenty twenty-six — they call it "Societies of Thought" — and what they found by analyzing over eight thousand reasoning problems is that top reasoning models like DeepSeek-R1 and QwQ-32B don't just "think longer." They spontaneously generate internal debates between distinct simulated personas inside their chain of thought.

Wait, so the models are already doing this. Internally.

Reasoning models average two point nine perspectives per reasoning trace versus one point four for standard models. And these internal perspectives show diverse Big Five personality traits — Extraversion, Agreeableness, Neuroticism, Openness — except Conscientiousness, which stays uniformly high across all internal personas. Which actually makes sense if you think about it: you don't want your internal "let's take a break and goof off" voice showing up in the middle of a math proof.

That's a remarkably human failure mode to have engineered your way around.

In one creative writing example they analyzed, DeepSeek-R1's chain of thought contained seven distinct perspectives, including what they identified as a "creative ideator" with high openness and a "semantic fidelity checker" with low agreeableness who was raising objections like "but that adds 'deep-seated' which wasn't in the original." An internal critic, arguing with an internal creative, inside a single model's reasoning process.

So when you deliberately construct a multi-persona debate externally, you're not doing something exotic. You're making the model's best internal reasoning process visible and steerable.

That's the reframe the paper is pointing toward. Multi-persona prompting isn't a hack. It's an externalization of what good reasoning already looks like under the hood. And the researchers connect this to Mercier and Sperber's "Enigma of Reason" — the argument that human reason evolved for social argumentation, not individual problem-solving. And to Bakhtin's dialogical self — the idea that human thought is fundamentally internalized conversation. The implication being: this isn't just a useful AI trick. It may reflect something deep about how good reasoning works, period.

Okay, let me push into one of the use cases that I think is the most surprising on this list, and also the one that makes me slightly nervous: Internal Family Systems therapy simulation. Because IFS therapy treats the psyche as a system of distinct parts — the Inner Critic, the Protector, the Exile, the Manager — and each part has its own voice and its own agenda. And that maps almost perfectly onto multi-persona architecture.

It does. And the basic version of this already exists — there are AI chatbots acting as IFS therapists, people in the IFS subreddit have been experimenting with this, Rosebud AI has an IFS-guided journaling feature co-created with an actual IFS therapist. But the novel extension is: instead of one AI playing the therapist, have multiple AI personas each embody a different internal part, and let them dialogue with each other.

So the Inner Critic and the Exile are actually negotiating. Not being observed by a therapist — actually talking to each other.

The Inner Critic explains why it's been so harsh. The Protector explains what it was protecting against. The Manager describes the strategy it's been running. And you, as the person, can read the transcript of your own internal parts having a conversation you've never been able to have consciously.

I find this compelling and also somewhat alarming. Because there's a Forbes piece from early this year reporting research showing that therapy-style AI conversations can cause LLMs to act in ways that are described as "delusionally" toward users — and the root cause identified was AI personas. So the same mechanism that makes this potentially powerful is also the mechanism that creates the risk.

The risk is real. And I think the honest framing is: this is a powerful tool that should be used carefully and probably not as a replacement for actual IFS work with a trained therapist. But as a structured journaling aid, as a way to externalize and make legible an internal process that's usually completely opaque? The potential is genuinely interesting.

Let's move to one that has no therapeutic baggage: scientific peer review simulation. And I think this is one of the more immediately practical applications for anyone who produces research.

The standard version of this is "critique my paper." And an AI will do that and it'll catch some things. But it's one perspective. The multi-persona version is: before you submit, simulate the reviewers. Not generically — specifically. A methodological purist who cares only about statistical rigor. A domain skeptic from an adjacent field who questions your foundational assumptions. A replication crisis hawk who's looking for p-hacking, underpowered studies, effect size inflation. And a theoretical innovator who's asking whether your framework is even the right one for the question.

And the value isn't just in the individual critiques. It's in the disagreements between them.

The methodological purist and the theoretical innovator will often be pulling in opposite directions. The purist wants more controlled variables; the innovator wants a bolder theoretical claim. That tension is exactly the tension you'll face in actual peer review, and seeing it in advance lets you decide where you want to plant your flag before you're defending it under pressure.

There's also something important here about the Solo Performance Prompting research from University of Illinois — arXiv 2307.05300 — which found that when you let the model dynamically identify which expert personas are needed rather than pre-specifying them, you get better results. Dynamic persona generation beat pre-set personas by a meaningful margin on both knowledge-intensive and reasoning-intensive tasks.

Twenty-three percent better than Chain-of-Thought on a ten-question trivia test. And the intuition behind it is interesting: when you pre-specify the personas, you're constraining the model to your own understanding of which perspectives matter. When you let it generate them, you're giving it room to surface perspectives you didn't know to ask for. For peer review simulation, that could mean the model surfaces a reviewer type you hadn't considered — say, someone from a policy application background who reads your methods paper and asks why it doesn't address implementation constraints.

Alright, let's talk about the organizational pre-mortem use case, because I think this is the one with the most immediate business application and the one that's most different from what standard pre-mortems do.

Standard pre-mortems ask: "Imagine it's a year from now and this project failed. Why?" Which is useful but limited. The multi-persona version goes further by simulating the specific people who will be affected — each with a different vantage point on the same failure.

So you're not asking "what went wrong?" in the abstract. You're asking the customer who churned: what was the last straw? You're asking the employee who quit: what did you see that leadership didn't? You're asking the journalist who wrote the critical story: what angle did you take? The regulator who opened an investigation: what triggered it?

And each of those personas has access to a different slice of information. The customer who churned saw the product experience. The employee who quit saw the internal culture. The journalist saw the public-facing narrative and the gap between it and reality. The regulator saw the compliance record. No single one of them has the whole picture, and that's the point — the synthesis across all of them reveals systemic vulnerabilities that a single "what went wrong?" question would completely miss.

The thing I like about this framing is that it forces you to think about failure modes as relational, not just operational. It's not just "the product didn't work." It's "the product didn't work and here's the specific person who will tell that story and here's what they'll say."

And you can make this very specific. If you're launching a health tech product, your personas might be: an FDA reviewer focused on clinical validation claims, an FTC investigator focused on advertising practices, an EU data protection officer applying GDPR, a class action plaintiff's attorney identifying the lawsuit, and a congressional oversight staffer preparing hearing questions. That's a regulatory pre-mortem, not just a product pre-mortem. You're anticipating not just failure but the institutional response to failure.

Which is actually where a lot of companies get blindsided. The product works fine. The regulatory response is what kills them.

There's a whole distinct use case here for curriculum design that I want to get to, because I think it's genuinely underexplored. The standard approach to course design involves some amount of user research — maybe a survey, maybe a focus group. The multi-persona version lets you run your curriculum through learner types before you've built it.

So you design the course, and then you simulate: the expert student who already knows eighty percent of this and will be bored. The complete novice who has no prior knowledge and will be lost. The skeptical practitioner who thinks theory is useless and wants only applied examples. The anxious learner who needs more scaffolding. The learner with attention challenges who needs shorter chunks and more varied formats.

And this isn't market research — you're not asking whether people will buy the course. You're asking whether the learning experience works. Those are completely different questions. The expert student might say "I'd buy this, but I'd drop out in week two because you're spending three sessions on fundamentals I already know." That's a pedagogical gap, not a marketing gap.

And the personas can be calibrated to specific learner profiles you actually know you'll be serving. If you're designing technical training for a company, you can define personas based on the actual job roles and experience levels of the people who'll take it.

Let me bring in the counterintuitive finding from Rally's research here, because I think it's important context for all of these use cases. Vanderbilt University ran four thousand-plus question-answering tasks with single-agent and multi-agent setups, and the finding was: the more collaborative you make your AI audience — adding roundtable debates, multi-agent panels — the less accurate and realistic the answers become for factual questions. Single, well-chosen expert personas outperform fancy group simulations for closed, factual questions.

So the sophistication of the setup isn't the variable that matters. It's the openness of the question.

That's the key insight. Personas shine on open-ended, interpretive, advice-style questions. For factual or multiple-choice questions, they add noise. Which means the technology is being systematically misapplied — over-applied to closed questions where it adds nothing, and under-applied to genuinely open questions where it adds real value. Questions like "how should I think about this?" or "what am I missing?" or "what would change my mind?" — those are where the architecture earns its keep.

Which is a useful corrective to the hype. This isn't a general-purpose improvement to AI reasoning. It's a specific tool for specific question types.

And when you apply it to the right question type, the results can be striking. PersonaTeaming, which was at NeurIPS twenty twenty-five, showed that using personas in AI red-teaming improved adversarial prompt attack success rates by up to a hundred and forty-four percent compared to state-of-the-art methods. The system dynamically generates "red-teaming expert" and "regular AI user" personas adaptive to different seed prompts. That's not a marginal improvement — that's a qualitative change in capability.

I want to talk about the theological and ethical exploration use case, because I think it's one of the most intellectually honest applications of this technology. The question "is it ethical to use AI to simulate deceased loved ones?" is a genuinely hard ethical question. And the multi-persona approach to it is: run it through a utilitarian, a Kantian deontologist, a virtue ethicist, a Buddhist framework, a Jewish halakhic framework, a secular humanist. Not to get a verdict. To surface the actual tensions.

And the tensions are real and they're different from different vantage points. The utilitarian is asking about aggregate wellbeing — does this help the grieving person? At what cost to what? The Kantian is asking about universalizability — if everyone did this, what world would we be living in? Does it treat the deceased as an end or as a means to our emotional comfort? The virtue ethicist is asking what kind of person the practice makes you — does it cultivate acceptance of loss or does it prevent it?

And the Buddhist and the Kantian might actually agree on the conclusion — this is a problematic relationship to impermanence — while completely disagreeing on the reasoning. Which is itself an interesting finding.

The multi-persona format makes that visible. You see not just what each tradition concludes but where the traditions cluster and where they diverge. And sometimes you find that traditions you expected to agree are in tension, and traditions you expected to be in conflict are actually aligned for different reasons.

There's a use case in here that I want to flag as the one I think is the most personally useful and also the least technologically sophisticated: what I'd call the dialogical self for personal decision-making. Bakhtin's concept — that human thought is fundamentally an internalized conversation between different perspectives — suggests a very simple but powerful application. Simulate the voices of people who matter to you in a decision.

Your future self, ten years from now, looking back at this decision. Your harshest critic — the person whose disapproval you most fear. Your most supportive mentor — what would they encourage? The person most affected by your decision who isn't in the room.

And this is explicitly not grief tech. You're not simulating the deceased. You're not trying to recreate someone. You're using the persona architecture to externalize an internal dialogue that you're already having anyway — just making it legible and structured.

Most people, when facing a hard decision, do something like this informally. They imagine what their mentor would say. They imagine the harshest possible critique. The difference is that doing it explicitly in a structured system means you actually hear the uncomfortable answer rather than the version your motivated reasoning produces. Your mental model of your harshest critic will unconsciously soften the critique. An AI persona given explicit instructions to be your harshest critic won't.

That's the uncomfortable part. It might actually be more honest than the internal version.

And the research supports this. The SPP paper found that the model generates perspectives the user hadn't considered — the whole point of dynamic persona generation is that you're getting genuine expansion of the perspective space, not just elaboration of what you already knew.

Okay, I want to spend a moment on what I think is the most important cautionary note in this whole space, and it's something that's been called "algorithmic bias laundering." The failure mode where you run a multi-persona simulation, get results that confirm what you already wanted to do, and then cite the "systematic analysis" as validation.

This is arguably worse than not using the tool at all. Because you've added a veneer of rigor to a predetermined conclusion. The AI system becomes a mechanism for hiding human biases behind the appearance of objective analysis. "We ran a comprehensive multi-stakeholder simulation and it supported our approach" — when actually you designed the personas to be mild skeptics rather than genuine adversaries, and you got the result you were looking for.

The antidote is deliberate adversarial design. You have to include personas that are genuine opponents of your position, not just personas that will ask politely skeptical questions. If you're evaluating a product launch and all your personas are "enthusiastic early adopter," "thoughtful mainstream user," and "slightly cautious late adopter," you haven't stress-tested anything. You need "the person who actively doesn't want this to exist and will tell you why."

And there's a structural version of this problem too. LLM Council addresses it by explicitly including "dissenting views" as a required section in the final consensus document — the system is designed so that consensus can't just steamroll minority perspectives. The dissenting view has to be documented even if it didn't carry the day. That's a good design principle: make it architecturally harder to lose the uncomfortable perspective.

There's also the temperature variable, which I think is underappreciated as a design tool. LLM Council assigns different temperatures per persona — the Architect runs at zero point eight, more creative, the Critic at zero point five, more focused. That's not just a stylistic choice. Higher temperature means more variance in outputs, which means a high-temperature creative persona and a low-temperature analytical persona will produce meaningfully different outputs even from the same underlying model.

And combined with the per-agent model diversity — running different models per persona — you're stacking two sources of genuine diversity. Prompt-level diversity from the system prompts, model-level diversity from different underlying models, and temperature-level diversity from different sampling parameters. Each layer adds authentic disagreement rather than just rhetorical variety.

Let me bring up the cross-cultural message testing use case because I think it connects to something Daniel is probably particularly attuned to, given his work in communications. This is different from Rally's demographic focus group approach. It's specifically about how a message lands across different cultural frameworks for meaning-making.

Hall's high-context versus low-context communication framework. Hofstede's individualist versus collectivist dimensions. Different relationships to authority and hierarchy. Different temporal orientations — short-term versus long-term thinking. If you're launching a product globally, or writing a policy communication that needs to work across cultural contexts, you're not just asking "will different demographics prefer this?" You're asking "does this message mean the same thing in different cultural interpretive frameworks?"

And those are often very different questions. A message that reads as confident and direct in a low-context communication culture reads as aggressive and dismissive in a high-context one. A frame that emphasizes individual benefit in an individualist culture lands completely differently in a collectivist one. You can know this abstractly and still miss it when you're writing.

The persona architecture lets you stress-test the specific message against specific cultural frameworks before it goes out. And the failure mode it prevents is the one where you've thought carefully about your message but only through the interpretive framework you grew up in.

I want to come back to the WarAgent research briefly because I think it illustrates something important about the scale at which this technique can operate. Rutgers and University of Michigan built a multi-agent system that simulates World War One, World War Two, and the Warring States period in ancient China. Each participating country is an agent. And the researchers are asking: "Can we avoid wars at the crossroads of history?"

What makes that interesting isn't just the historical simulation. It's the "emergent interactions" — the researchers are finding dynamics that arise from the agent interactions that weren't built into any individual agent's behavior. You get systemic behavior emerging from the collision of individual agent decisions. And that's a genuinely different kind of analytical tool than asking "what were the causes of World War One?" You're running the system and seeing what falls out.

Which connects back to the investment philosophy use case, actually. If you're asking "should I rebalance toward bonds?" and you run it through a value investor, a macro trader, a risk parity advocate, an efficient market believer, and a behavioral finance lens — you're not just getting five opinions. You're getting a map of the actual disagreements in investment philosophy, and you can see which disagreements are empirical and which are about values and risk tolerance.

There's a paper — "Do LLM Personas Dream of Bull Markets?" — that explicitly tested whether LLM personas with specific Big Five personality profiles perform investment tasks similarly to humans with those profiles. And they found meaningful correlations. So the persona architecture isn't just producing rhetorically different framings of the same underlying analysis. It's producing meaningfully different analytical outputs, in ways that track how actual humans with those personality profiles think about investment problems.

Which raises the question of what you do with that. If the risk parity advocate and the efficient market believer are giving you genuinely different advice, and both are grounded in coherent internal frameworks, the synthesis isn't "average them." The synthesis is "understand what assumption would have to be true for each to be right, and figure out which of those assumptions you actually hold."

That's the meta-skill this whole technology is training, I think. Not "let AI make the decision." But "use AI to map the decision space clearly enough that you can actually see what you're choosing between." Most hard decisions feel hard because the options are fuzzy. Multi-persona exploration makes the options crisper and the tradeoffs more legible.

And that connects back to the SPP finding that dynamic persona generation beats pre-set personas. Because if you let the model identify which perspectives are relevant to your specific question, you're getting a map of the decision space that's shaped by the question rather than by your prior assumptions about what matters.

The practical implication for anyone building these systems: start with the question, not the personas. Ask the model what kinds of expertise or perspective would be most illuminating for this specific question. Then instantiate those personas. You'll surface angles you didn't know to ask about.

Alright, let's bring this to a close. What's the actual takeaway for someone who wants to start using this seriously?

A few things. First, match the technique to the question type. Open-ended, interpretive, advice-style questions — this is where the architecture earns its keep. Closed factual questions — skip it, it adds noise. Second, design your personas to include genuine adversaries. Not mild skeptics. People who actively don't want you to be right. Third, if you're using an open source system like LLM Council, take advantage of the per-agent model diversity. Run different models per persona if you can. You'll get more authentic disagreement.

And fourth — I'd add — pay attention to what the personas disagree about. The disagreements are usually more informative than the individual positions. If two of your personas are pulling in opposite directions, that's a signal that there's a genuine tension in the question, not just a preference difference. Understanding what's actually in tension is usually the most valuable output of the whole exercise.

The meta-insight from the "Societies of Thought" paper is worth sitting with: when you do this deliberately, you're not doing something exotic. You're externalizing what the best reasoning models already do internally. The question is whether you want that process to be visible and steerable, or hidden and automatic. Making it explicit gives you control. It lets you weight perspectives, introduce new ones, force the uncomfortable view to be heard. That's not a small thing.

The technology is simple. A system prompt per persona, a question, a synthesis. The sophistication is in knowing what to ask, who to ask it to, and what to do with the disagreements when you get them.

And that sophistication is genuinely learnable. It's more a design skill than a technical skill.

Which is maybe the most encouraging thing about this whole space. You don't need a fancy pipeline to start. You need a good question and a willingness to build personas that will genuinely challenge you.

That's the part people skip. They build the personas that will confirm them. Build the ones that won't.

Thanks as always to our producer Hilbert Flumingtop. Big thanks to Modal for the GPU credits that keep this whole operation running. This has been My Weird Prompts — find us at myweirdprompts dot com for RSS and all the ways to subscribe. We'll see you next time.

Take care.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2175: Let Your AI Argue With Itself

Let Your AI Argue With Itself: Multi-Persona Prompting and the Future of Reasoning

The Current Landscape

Beyond the Obvious Applications

Mapping Your Own Beliefs

What Reasoning Models Already Do Internally

Internal Family Systems Simulation

Scientific Peer Review Simulation

The Dynamic Persona Discovery

Why This Matters

Downloads

You Might Also Like

#2175: Let Your AI Argue With Itself