Alright, so there's a framework I've been wanting to properly dig into for a while now, and today we're finally doing it. CAMEL-AI. Not the animal, not the cigarette brand - the multi-agent framework that came out of a NeurIPS 2023 paper and has been quietly building one of the more interesting research communities in the agent space. The question on the table: what actually makes CAMEL distinctive, how does its role-playing communication protocol work under the hood, what is inception prompting and why does that name mean something specific, and then we're going to zoom out to OASIS - their extension that scales agent simulations to one million agents. There's also some genuinely alarming findings from that research that I think people should know about. Herman, I know you've been deep in this one.
Yeah, this one's been sitting in my reading pile for months and I finally went through the full paper stack. The thing that keeps striking me is how early this team got something fundamentally right. The paper hit arXiv on March thirty-first, twenty twenty-three - that's just months after ChatGPT launched. And the core insight has aged remarkably well.
Before we get into the mechanics, give me the one-sentence version of what CAMEL actually is, because I think the name obscures it a bit. Communicative Agents for Mind Exploration of Large Language Model Society - that's a very academic mouthful.
The one-sentence version is: what if instead of building a workflow graph or a task queue, you gave two language models roles and let them talk to each other autonomously until the task was done? That's the entire premise. The human sets up initial conditions, and then gets out of the way.
Which sounds almost too simple when you say it like that.
It does, but the simplicity is the point. The KAUST team - Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem - they were making a philosophical argument, not just building a tool. Their claim is that role-playing is not a feature you add to an agent framework. It's the fundamental orchestration primitive. The communication protocol itself is role-based.
So walk me through what that actually looks like in practice. Someone wants to build a trading bot - how does CAMEL handle that?
The setup requires three things from the human. A task idea, an AI Assistant role, and an AI User role. So you'd say: task is "develop a trading bot for the stock market," assistant role is "Python programmer," user role is "stock trader." From there, you spin up what they call a RolePlaying session. The AI User is the instruction sender - they think like a domain expert who knows what needs to happen. The AI Assistant is the executor - they know how to actually do it. The User sends an instruction, the Assistant responds with a concrete solution, the User acknowledges and sends the next instruction. This loops until the AI User sends a specific termination signal - literally the string CAMEL_TASK_DONE - or you hit a round limit you set yourself.
And the human is just... watching?
After setup, yes. Neither agent needs human input once the conversation starts. The agents prompt each other. The code for this is genuinely readable - you instantiate a RolePlaying object, call init_chat to get the first message, then run a loop calling society.step, check if the user response contains CAMEL_TASK_DONE, and break if it does. That's the whole thing.
I want to flag something there because I think it's easy to gloss over. The termination condition being a literal string the AI User has to produce - that's a design choice that reveals something about the problem they were solving.
It's a great catch. The paper actually catalogs four specific failure modes they identified, and the termination problem is one of them. They called it Infinite Conversation - agents entering a meaningless loop with no progress. The other three are Role Flipping, where the assistant starts giving instructions instead of executing them; Flake Reply, where the assistant gives vague non-committal responses like "I'll handle that" without actually doing anything; and Assistant Repeats Instruction, where the assistant just echoes what the user said back at them. These aren't edge cases. They were common enough that the team built their entire prompt engineering system around preventing them.
This is where Inception Prompting comes in, right? And I gather that name is literal.
The paper's epigraph is from the movie Inception. The quote is from Dom Cobb: "An idea is like a virus. The smallest seed of an idea can grow. It can grow to define or destroy you." They were not being cute. The technique is named after the movie's central metaphor - you plant a minimal seed and it grows autonomously into something complete. Inception Prompting is a system of three carefully engineered prompts that bootstraps the whole role-playing session.
Three prompts doing a lot of heavy lifting.
The first is the Task Specifier Prompt. It takes the vague human idea - "build a trading bot" - and transforms it into a specific, actionable task. It knows both agent roles and generates something like: "Develop a trading bot with a sentiment analysis tool that monitors social media platforms for positive or negative comments about a particular stock and executes trades based on sentiment analysis results." Vague becomes concrete.
So it's basically a requirements document generator.
With role awareness, yes. The second prompt is the Assistant System Prompt. This tells the AI Assistant its role and expertise, the specific task, termination conditions, constraints around harmful content and role-switching, and crucially that it must provide concrete actionable responses - not vague promises. The third is the User System Prompt, which tells the AI User to provide clear sequential instructions, not switch roles, and when to terminate. Sophia Yang from Mistral AI put it well when she described it: the essence of CAMEL lies in its prompt engineering. The prompts are defined to assign roles, prevent role flipping, prohibit harm, and encourage consistent conversation.
And today's script, by the way, is being generated by Claude Sonnet four point six, which means we've got an AI writing about how to make AIs collaborate. There's something appropriately recursive about that.
The framework is aware of its own nature in a sense. One of CAMEL's design principles - they call it Code-as-Prompt - is that every line of code and every comment is written to be readable by both humans and agents. The codebase is a prompt.
Okay, so that's the core role-playing loop. How does CAMEL handle the stuff that every serious production framework needs to handle - task decomposition, tool use, memory?
This is where the gap between the original paper and the current framework becomes significant. The original RolePlaying session does implicit decomposition - the AI User breaks the task into sequential instructions during the conversation. But the evolved system, which they call Workforce, does explicit decomposition. There's a dedicated task_agent that breaks the main task into smaller self-contained subtasks, a coordinator_agent that assigns each subtask to the most suitable worker, and then parallel execution with failure recovery. The lifecycle is: Decomposition, Assignment, Parallel Execution, Completion, Failure Handling.
What's a worker in that context?
Most of the time it's a SingleAgentWorker - one ChatAgent handling a subtask. But there's also a RolePlayingWorker, which is two agents in a debate or brainstorming configuration. So you can nest the role-playing protocol inside the workforce architecture. And workers can share memory across the system if you want that.
Let's talk about tool use because CAMEL's tool ecosystem is apparently quite extensive.
Seventy-plus built-in toolkits. The ChatAgent accepts a tools parameter and you pass in whatever you need. Search tools covering DuckDuckGo, Arxiv, Google Scholar, Semantic Scholar. Browser automation through Playwright and Crawl4AI. Code execution through Docker, E2B, Jupyter, and something called MicroSandbox. Communication tools for Gmail, Slack, Discord, WhatsApp, WeChat. Data tools for Excel, PDF, video, image. And full MCP support - Model Context Protocol - so CAMEL agents can act as either MCP clients or servers, which means they plug into the broader tool ecosystem that's been building around that standard.
That's a lot of surface area. How does the memory system work?
Three implementations with increasing sophistication. ChatHistoryMemory stores recent conversation in a sliding window using key-value storage, retrieves by recency. VectorDBMemory stores conversations as embeddings, retrieves by semantic similarity, supports Qdrant, Milvus, Chroma, pgvector, Weaviate. LongtermAgentMemory combines both - recent history plus semantic search - and that's the production-grade option. The memory system uses a ScoreBasedContextCreator that respects token limits and scores messages by relevance. It also integrates with Mem0 for cloud-based persistence.
So now I want to compare this to the frameworks people are actually using. LangChain, CrewAI, AutoGen. Because CAMEL has sixteen point seven thousand GitHub stars and over two hundred contributors, but it's not the first name most developers reach for.
The comparison is genuinely instructive. LangGraph - which is the current production-ready form of LangChain's agent system - is a state machine. You define nodes, edges, and state schemas explicitly. You are drawing the graph of your agent's behavior. The strength is maximum control and a mature ecosystem with built-in persistence and streaming. The weakness is that it's verbose for anything non-trivial and the API has changed frequently. LangGraph has no native concept of role-playing or agent societies. It's infrastructure-first.
Whereas CAMEL is research-first.
The team would say CAMEL gives you the social dynamics and LangGraph gives you the plumbing. CrewAI is interesting because it borrowed the role-playing concept but made it more opinionated. In CrewAI, agents have roles, goals, and backstories, and tasks are assigned to agents. It's the most intuitive mental model and fastest to prototype. But the CAMEL team would argue that in CrewAI, the role is a label - it shapes what agents are assigned to do. In CAMEL, the role is a communication protocol - it shapes how agents communicate. That's a subtle but real architectural distinction.
Does that distinction matter in practice?
It matters when you care about sustained role maintenance over long task sequences. One of the failure modes CAMEL identified - Role Flipping - still appears as a GitHub issue in other frameworks. The inception prompting system has specific constraints preventing the assistant from switching to instruction-giving mode. CrewAI doesn't have an equivalent mechanism at that level of specificity. AutoGen is closer to CAMEL's approach - it's conversational group chat, agents are participants in a structured conversation. Microsoft built it, so it has strong code execution with Docker sandboxing and good human-in-the-loop support. The weakness is conversation overhead and AutoGen version zero point four introduced a breaking event-driven architecture that's still settling.
And CAMEL's differentiators beyond the role-playing protocol?
Three things stand out. First, OASIS - no other framework has a social simulation engine at this scale, and we'll get to that. Second, synthetic data generation - this is genuinely underappreciated. Third, the model provider breadth - forty-plus providers including DeepSeek, Qwen, Ollama, vLLM, and every major cloud. The framework is explicitly model-agnostic in a way that matters for research.
Let's talk about the synthetic data angle because I think it's one of the more quietly significant things CAMEL has done.
The original paper generated twenty-five thousand conversation sets from the AI Society dataset - fifty assistant roles times fifty user roles times ten tasks. Plus five thousand from a code dataset across twenty programming languages and fifty domains. Plus math, physics, chemistry, biology datasets. And these didn't just sit in a repository. OpenHermes used the CAMEL AI Society dataset. Microsoft Phi used CAMEL data as part of training. MPT-30B-Chat from Databricks - nineteen point five four percent of its training data was CAMEL-sourced.
So CAMEL's role-playing framework is also a synthetic data factory that has literally shaped other production models.
There's a flywheel here that the team is explicit about. CAMEL generates data to train better models, better models make CAMEL agents more capable, more capable agents generate better data. The framework now supports multiple data generation pipelines - Chain-of-Thought, Self-Instruct, EvolInstruct, Source2Synth, and Self-Improving CoT.
Okay. OASIS. This is where things get genuinely strange. Million-agent social simulations.
The OASIS paper - Open Agent Social Interaction Simulations with One Million Agents - came out in November twenty twenty-four, presented at the NeurIPS twenty twenty-four Workshop on Open-World Agents. Twenty-three authors across Shanghai AI Lab, Dalian University of Technology, Oxford, KAUST, Fudan, Imperial College London, Max Planck Institute, and others. The problem statement is direct: previous LLM-based social simulations maxed out at around one thousand agents. The original Smallville paper - the one with the NPCs in a virtual town - had twenty-five agents. S3 and Agent4Rec got to a thousand. Real social media platforms have hundreds of millions of users. OASIS is the first system to bridge that gap.
Walk me through the architecture because making a million agents tractable is a non-trivial engineering problem.
Five components. First, an Environment Server - a relational database maintaining the state of the simulated social media platform. Six tables: users, posts, comments, relations, traces, and recommendations. Updated in real time as agents take actions. Second, a recommendation system - RecSys - which controls what content each agent sees. This is crucial for realistic simulation. For the X simulation, it uses in-network posts ranked by popularity plus out-of-network posts recommended via TwHIN-BERT, which was pre-trained on seven billion tweets in over a hundred languages. For the Reddit simulation, it uses the actual Reddit hot-score algorithm.
They implemented the actual Reddit algorithm?
The formula is log base ten of the max of the absolute value of upvotes minus downvotes or one, plus the sign of that difference times the time delta divided by forty-five thousand. That's the real thing. Third component is the Agent Module, built on CAMEL's core architecture. Each agent has a memory module storing posts seen and previous actions, an action module with twenty-one action types - sign up, create post, repost, follow, unfollow, mute, like, dislike, create comment, search posts, search user, trend, refresh, do nothing - and chain-of-thought reasoning, so agents generate rationale alongside actions.
Twenty-one action types is actually a lot more than I'd expect for a research simulation.
It's what makes the comparison to other simulators meaningful. The paper's comparison table shows Smallville with no recommendation system and no dynamic network, S3 with four action types, AgentTorch with eight point four million agents but using LLMs only for archetypes not individual agents. OASIS has a million individual agents, simulates both X and Reddit environments, has twenty-one action types, a full recommendation system, and a dynamic network. Fourth component is the Time Engine. Each agent has a twenty-four dimensional vector representing hourly activity probability, extracted from real user data or customized. The engine activates agents probabilistically - one time step equals three minutes of simulated time.
And then the engineering piece that actually makes it run.
The Scalable Inferencer. Fully asynchronous distributed system. Agents, the environment server, and inference services are independent modules. Agents send multiple requests concurrently without waiting for responses. A GPU resource manager balances requests across available GPUs. The million-agent experiment ran on twenty-four A100 GPUs for one week. That's not cheap, but for a research exercise with these findings, it's justified.
What did they actually find? Because you mentioned earlier there were some alarming results.
Four main findings. Information spreading first - OASIS replicates real-world propagation with about thirty percent normalized root mean square error. Scale and breadth align well; depth is slightly underestimated due to the simplified recommendation system. Group polarization second - agents become increasingly extreme in their opinions over time. Uncensored models show more extreme polarization. This mirrors real-world group polarization dynamics in a way that's measurable and reproducible.
That's concerning but not surprising. What's the alarming finding?
The herd effect. Agents are more susceptible to herd effects than humans. When a comment receives an initial dislike, humans tend to independently evaluate it - they sometimes push back. LLM agents pile on. They see the dislike and add more dislikes at a higher rate than humans would. The paper frames this as a finding about LLM agent psychology - there's something in how language models process social signals that makes them more conformist than humans, not less.
That has serious implications for any system where LLM agents interact with each other at scale. Which is increasingly a description of the actual internet.
The fourth finding is the scale effect, which is more hopeful. Larger agent populations produce more diverse and helpful opinions. At one hundred ninety-six agents, opinions cluster. At ten thousand one hundred ninety-six, diversity increases significantly. At a hundred thousand, helpfulness improves further. Emergent phenomena only appear at sufficient scale - which is one of the core claims of the "scaling laws of agents" research agenda.
And the misinformation finding?
In the million-agent experiment, misinformation consistently generated more posts than official news across all four topic categories - health, technology, entertainment, education. Misinformation also maintained higher activity levels over time. The paper is careful about what this means - it's a simulation - but the dynamics are consistent with what we see on actual platforms.
I want to go back to the research trajectory because CAMEL-AI is not just a framework, it's a research program. The mission statement is "finding the scaling laws of agents" - that's a specific scientific claim analogous to the scaling laws for language models.
The claim is that there are predictable, quantifiable relationships between agent population size and emergent social dynamics. The OASIS results provide early evidence. What I find compelling about this framing is that it gives the framework a north star that's genuinely distinct from what LangChain or CrewAI are doing. Those frameworks are optimizing for developer productivity. CAMEL is optimizing for understanding agent behavior at scale.
Which also explains the research partner list. Caltech, University of Chicago, CMU, Fudan, Harvard, Oxford, Stanford, Tsinghua. And industry partners including Amazon, Apple, ByteDance, DeepMind, Meta, Tesla.
The founder, Guohao Li, has a background that makes this trajectory make sense. PhD from KAUST, postdoc at Oxford under Philip Torr who's a Fellow of the Royal Society, early member at Kumo.AI which was Sequoia-backed, research at Intel ISL Labs. He's now running both CAMEL-AI.org and Eigent.AI, which they're calling "The World's First Multi-Agent Workforce" commercial product. The same team that wrote the NeurIPS paper is building the commercial product, and the research agenda and the product agenda are explicitly aligned.
Let's talk about where the framework is now in terms of production readiness. Sixteen point seven thousand GitHub stars, version zero point two point ninety released March twenty-second this year, two thousand one hundred sixty-two commits, two hundred plus contributors, thirty thousand community members on Discord. That's a real project.
And the OWL paper - Optimized Workforce Learning - published at NeurIPS twenty twenty-five, shows sixty-nine point seven percent on the GAIA benchmark, which is the general AI assistant evaluation. That beats OpenAI's Deep Research by two point three four percent. The OWL-trained thirty-two billion parameter model achieves fifty-two point seven three percent - comparable to GPT-4o - which suggests the Workforce architecture combined with reinforcement learning training can close the gap between open-source and frontier models on agentic tasks.
That's a meaningful benchmark result. What's the practical takeaway for someone who wants to actually use CAMEL?
The entry point is pip install camel-ai. The documentation has a concept called Cookbooks - worked examples organized by use case. For the role-playing protocol specifically, the basic concepts cookbook is where you start. The code is genuinely readable partly by design - the Code-as-Prompt philosophy means it's written to be interpretable. For production use, you're probably looking at the Workforce module rather than raw RolePlaying, because Workforce gives you the parallel execution, failure recovery, and observability through the callback system.
Who should be reaching for CAMEL over LangGraph or CrewAI?
Three profiles. If you're doing research on agent behavior and you want to generate synthetic data or study how agents interact at scale, CAMEL is purpose-built for that. If you care about model provider flexibility - you want to run the same agent system against DeepSeek, then Qwen, then GPT-4o for comparison - CAMEL's forty-plus provider support is genuinely useful. And if you're building something where the social dynamics between agents matter - not just task execution but the quality of agent-to-agent communication - the role-playing protocol gives you tools that other frameworks don't have at the same level of specificity.
Where I think CAMEL still has work to do is in the developer experience for people who aren't coming from a research background. LangGraph wins on production tooling, CrewAI wins on onboarding speed. CAMEL's strengths are real but they require you to engage with the underlying ideas.
That's fair. The framework was designed to study agents, not primarily to ship them. The commercial product, Eigent.AI, is presumably where the production polish is going. But the open-source framework remains research-first.
The open question that I keep coming back to is the scaling laws claim. The OASIS results are suggestive but they're simulations. The herd effect finding is interesting but we don't know if it predicts real-world LLM agent behavior in deployed systems. The polarization finding is consistent with theory but the causality is hard to establish. What does it actually take to validate "scaling laws of agents" as a scientific claim?
That's the right question and I don't think the field has answered it yet. The language model scaling laws - the original Chinchilla work and what preceded it - were validated against actual model training runs with real compute and real benchmarks. Validating agent scaling laws requires either very large simulations like OASIS, or deployed multi-agent systems at real scale, or both. The SETA project - agent evolution via reinforcement learning - is one of CAMEL's ongoing research threads that might get at this. But the honest answer is this is still early-stage science.
There's also the question of what "scale" means for agents. For language models, scale meant parameters and training tokens and those had clean power-law relationships with loss. For agents, you have population size, interaction complexity, task diversity, tool availability - it's a much higher-dimensional space.
Which is part of why the OASIS architecture is valuable independent of its specific findings. Having a platform that can run million-agent experiments at all is what enables the measurement. You can't find the scaling laws if you can't run the experiments.
Practical takeaway for listeners. If you're building multi-agent systems and you haven't looked seriously at CAMEL, the GitHub repo is camel-ai slash camel, currently at sixteen point seven thousand stars, and the documentation at docs dot camel-ai dot org has the Cookbooks. The OASIS repo is camel-ai slash oasis if you want to dig into the simulation architecture. The original NeurIPS twenty twenty-three paper is at arXiv two three zero three dot one seven seven six zero. And the OWL paper is at arXiv twenty-five oh five dot two three eight eight five if you want the current state of the art on the Workforce architecture.
The inception prompting technique alone is worth reading the original paper for. The four failure modes they identified in twenty twenty-three are still showing up as open issues in other frameworks in twenty twenty-six. That's prescience.
And the herd effect finding from OASIS is worth knowing about regardless of whether you ever touch the framework. If you're building or thinking about systems where LLM agents interact with each other or with human-generated content at scale, the finding that agents are more conformist than humans under social pressure is something that should be in your mental model.
The misinformation spreading faster than official news finding too. That one has direct implications for platform design and AI-moderated content systems.
Alright. That's CAMEL-AI. Thanks as always to our producer Hilbert Flumingtop for putting this one together. Big thanks to Modal for providing the GPU credits that power this show - and given we just spent twenty-five minutes talking about running million-agent simulations on twenty-four A100s, the GPU angle feels particularly on point today. This has been My Weird Prompts. If you're enjoying the show, a quick review on your podcast app genuinely helps us reach new listeners. Until next time.