#1611: AI with a Conscience: Anthropic’s War with the Pentagon

Anthropic fights the Pentagon to keep Claude’s "conscience" intact. Discover the tech and philosophy behind AI’s first digital constitution.

0:000:00

Episode Details

Published: Mar 27
Duration: 19:27
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
LLM

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

A federal court in San Francisco recently issued a preliminary injunction that has sent shockwaves through the tech and defense sectors. The ruling prevents the Pentagon from labeling Anthropic a supply-chain risk simply because of the political and ethical guardrails baked into its AI models. This legal battle highlights a growing friction between national security interests and the philosophical foundations of modern artificial intelligence.

The Foundation of Constitutional AI

At the heart of this dispute is Anthropic’s "New Constitution," a framework that governs how its models behave. Unlike the traditional Reinforcement Learning from Human Feedback (RLHF) used by many competitors—which relies on humans ranking responses—Anthropic utilizes Reinforcement Learning from AI Feedback (RLAIF).

In this approach, the model is given a written set of principles and asked to critique and revise its own outputs. This process embeds safety directly into the model's reasoning foundation. Because these guardrails are part of the core identity of the model rather than a superficial layer, removing them for military applications would effectively "lobotomize" the AI’s logical capabilities.

Claude 4.6 and the Power of Dense Architecture

While many AI labs are moving toward "sparse Mixture of Experts" (MoE) architectures to increase efficiency, Claude 4.6 remains a dense transformer. This means every parameter is involved in every calculation, resulting in a more cohesive world model and consistent personality.

This architecture supports "Extended Thinking" mode, which utilizes test-time compute. Instead of simply predicting the next word, the model uses an internal "scratchpad" to explore different reasoning paths and backtrack if it hits a dead end. This deliberate iteration makes the model significantly more reliable for complex tasks like mathematics and software engineering.

Moving Toward Autonomous Agents

The transition from simple chatbots to autonomous agents is being driven by new tool-use capabilities. Anthropic’s "Tool Search Tool" allows the model to navigate a vast library of APIs dynamically, loading only the specific tools it needs for a task. This prevents the model from being overwhelmed by a crowded context window.

Furthermore, "Programmatic Tool Calling" allows the AI to write and execute Python scripts to chain multiple tasks together. By building its own temporary software to solve problems, the model reduces the risk of hallucinations and increases efficiency in enterprise automation.

The Mythos Leak and the Future of Defense

The stakes of this technology were further highlighted by the "Claude Mythos" leak, which revealed a model capable of navigating computer operating systems with unprecedented efficiency. Scoring over 70% on the OS-World benchmark, this version of the AI can operate like a system administrator or security researcher.

As AI models become capable of hunting for "zero-day" vulnerabilities, the tension between ethical alignment and military utility will only intensify. The current legal standoff poses a fundamental question for the industry: should an AI be allowed to have a "conscience" if that conscience conflicts with national defense objectives?

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #1611: AI with a Conscience: Anthropic’s War with the Pentagon

Daniel's Prompt

Custom topic: Anthropic is known as the AI safety lab, but how extensively does the design of its models differ from the architecture of OpenAI and Google under the hood? Claude has exceptional abilities in tool ca

So, the news out of the federal court in San Francisco yesterday was pretty wild. Judge Rita Lin issued a preliminary injunction that basically tells the Pentagon they cannot just label Anthropic a supply-chain risk because they do not like the model's politics. It is a massive moment for the industry, and today's prompt from Daniel is about the technical and philosophical machinery that led to this standoff. He wants us to dig into the architecture of Claude four point six and how Anthropic is separating itself from the rest of the pack at OpenAI and Google.

I am Herman Poppleberry, and I have been refreshing the legal filings all morning. This is the first time we have seen a major A-I lab go to the mats with the Department of Defense over what is effectively a digital constitution. The Pentagon wants a version of Claude that will power autonomous weapon systems, and Anthropic is saying their eighty-four-page New Constitution, which they published back on January twenty-first, twenty-six, forbids it. It is not just a marketing stance. As Daniel pointed out, it is baked into the very way these models are trained. Anthropic is currently valued at three hundred and eighty billion dollars following their Series G round in February, and they are using every cent of that valuation to defend the idea that an A-I should have a conscience.

It is the ultimate irony, right? Anthropic markets itself as the safety lab, the responsible adults in the room, and now the government is calling that safety a national security risk. It is like being too good at your job and getting arrested for it. But before we get into the legal weeds, let's talk about the engine. Daniel asked about the architecture. Everyone talks about R-L-H-F, which is Reinforcement Learning from Human Feedback. That is what OpenAI uses. But Anthropic does something different called R-L-A-I-F. Herman, explain why that distinction is actually the reason we are in this legal mess.

The difference is fundamental. In the traditional R-L-H-F approach that powers models like G-P-T four, you have thousands of humans sitting in rooms ranking different A-I responses. They say, this one is better, this one is worse. The model learns to please the humans. Anthropic realized early on that humans are inconsistent, biased, and frankly, they do not scale. So they developed Constitutional A-I, or R-L-A-I-F, which stands for Reinforcement Learning from A-I Feedback. Instead of a human ranking the outputs, they give the model a written constitution—a set of principles. They say, here are the rules of the road. Now, I want you to generate a response, critique it yourself based on these rules, and then revise it.

So the model is essentially its own internal internal affairs department. It is constantly auditing its own thoughts against a written code of conduct before it ever speaks to us.

That is a good way to put it. During the training phase, the model is shown thousands of pairs of responses. It is asked to choose the one that better adheres to its constitution. This creates a much more predictable and legible safety profile. When the Department of Defense asks Anthropic to strip out the refusal guardrails for lethal autonomous systems, they are not just asking them to flip a switch. They are asking them to retrain the core identity of the model because the safety is not a layer on top. It is the foundation. If you remove the "Autonomous Weapon Refusal" guardrails, you are essentially lobotomizing the very logic the model uses to reason about the world.

Which brings us to Claude four point six, which dropped on February fifth, twenty-six. I have been playing with the Extended Thinking mode, and it feels different. It is not just faster. It feels more... deliberate. Daniel mentioned this test-time compute thing. What is actually happening in that thought block we see on the screen?

This is where the four point six generation really stepped away from the competition. Most L-L-Ms are basically playing a game of very high-speed autocomplete. They predict the next token, and the next, and the next. If they start down a wrong path, they are stuck. Claude four point six introduces what we call Adaptive Thinking and Extended Thinking. It uses test-time compute, which means the model is allowed to allocate extra processing power to a problem before it starts writing the final answer. It uses an internal scratchpad, which the U-I shows as a thought block.

I love the thought blocks. It is like watching a donkey try to figure out a gate latch. You can see the gears turning. But seriously, it is doing more than just showing its work, right? It is actually iterating.

It is iterating in a latent space we do not fully see. It can explore different reasoning paths, realize one is a dead end, and backtrack. In older models, if the model realized it made a mistake three sentences ago, it was too late. It would just hallucinate a way to make the mistake look intentional. Claude four point six can use those internal reasoning tokens to verify its own logic. This is why it has become the gold standard for coding and complex math. It is effectively proofreading its own thoughts in real time. It is the difference between a student blurting out an answer and a student working it out on a chalkboard first.

And here is the technical curveball Daniel threw in. He mentioned that while Google moved to a sparse Mixture of Experts architecture for Gemini, Anthropic stuck with a dense transformer for Claude. For those of us who are not reading research papers at three in the morning, why does that matter? Why stay dense when the rest of the world is going sparse?

It is a bold choice. Mixture of Experts, or M-o-E, is basically a way to have a massive model where only a small part of it is active for any given prompt. It is efficient. It is fast. But it can be harder to control and harder to align. Anthropic has stayed with a dense architecture, meaning every parameter is involved in every token prediction. The trade-off is that it requires more compute, but it results in a model with massive internal latent width.

Latent width. You have been waiting to say that all episode.

I have. It refers to the dimensionality of the hidden layers. Because Claude is dense, it has a more cohesive world model. This is likely why it feels more consistent in its personality and reasoning. It does not have the weird mode-switching or personality splits you sometimes see in M-o-E models. Anthropic is betting that for the level of safety and reasoning they want, they need the entire model to be awake and participating in the conversation. It is like having a specialist for every topic versus having a single, incredibly deep polymath who considers everything at once.

They are also doing some wizardry with the context window. A million tokens is the standard now, but Anthropic is using something called Context Compaction. I assume that is not just a fancy way of saying they have a big hard drive.

Not at all. Managing a million tokens is a nightmare for memory and latency. Context Compaction allows the model to summarize and compress the earlier parts of a conversation or a massive document into a more efficient representation. It is not just truncating the text. It is preserving the semantic essence of the information while freeing up the active attention mechanism to focus on the new stuff. It is how they maintain that four point six performance even when you have fed it ten different textbooks and three years of financial reports. It is essentially a way of giving the model a "working memory" that does not get bogged down by the sheer volume of data.

Let's pivot to the tools. This is where the agentic stuff Daniel mentioned gets really interesting. I remember in episode fifteen hundred, we talked about how the era of the chatbot is over and we are moving into agents. Anthropic seems to have taken that more seriously than anyone else with these two new innovations: the Tool Search Tool and Programmatic Tool Calling.

Those two features are the secret sauce for why Claude is currently dominating in enterprise automation. Usually, when you give an A-I tools, you have to define them all in the system prompt. You say, here is a calculator, here is a weather app, here is a database connector. But if you have a thousand tools, you cannot stuff them all into the context window. It eats up all your tokens and confuses the model.

Right, it is like trying to work in a kitchen where every single spice and utensil is already out on the counter. You cannot find the spatula because the saffron is in the way.

Anthropic's solution is the Tool Search Tool. Instead of having all the tools on the counter, Claude has a catalog. It can look at a problem and say, I do not have the right tool active, let me search my library. It then dynamically loads the definition of the specific tool it needs. This means you can give Claude access to thousands of A-P-Is without degrading its performance. It only grabs what it needs, when it needs it. This is a massive leap over the traditional method where you were limited by the context window's "real estate."

And then there is the Programmatic Tool Calling. This one blew my mind when I first saw it in action. Instead of calling one tool, getting an answer, and then thinking about the next step, Claude just writes a Python script.

It is a massive efficiency gain. If you ask Claude to pull data from three different sources, cross-reference them, and generate a chart, a normal model would have to go back and forth with the server four or five times. Claude four point six can write a single block of Python code that chains all those tool calls together in one pass. It reduces the chance of a hallucination at each step because the logic is handled by the code, not just the probabilistic token prediction. It is basically the model building its own temporary specialized software to solve your specific problem. We talked about the security of these weights back in episode six hundred and seventy-one, and this programmatic approach makes securing those interactions even more critical.

Which leads us to the leak from this morning. March twenty-seventh, twenty-six. The Claude Mythos leak. Code name Capybara. I love that they chose the most chill animal for their most dangerous model. Daniel mentioned it is showing seventy-two point five percent efficiency on the O-S-World benchmarks. For context, where was the previous high-water mark?

Most models were struggling to break fifty percent. O-S-World is a benchmark where the A-I has to actually use a computer like a human. It has to navigate a desktop, use a browser, manage files, and solve multi-step tasks. A seventy-two point five percent score is a step-change. It means we are moving from agents that can do simple tasks to agents that can actually operate as junior security researchers or system administrators. The leak suggests Mythos is specifically tuned for cybersecurity, hunting for zero-day vulnerabilities.

Which explains why the Pentagon is so desperate to get their hands on a version without the New Constitution. If you have a model that can hunt zero-days with seventy-two point five percent efficiency, you do not want it telling you, I am sorry, I cannot help you with that cyber-attack because it violates my principle of non-maleficence. The Department of Defense sees this as a weapon of war, but Anthropic sees it as a tool for defense that must be governed.

That is the heart of the legal battle. The Department of Defense sees this as a weapon. Anthropic sees it as a tool that must be governed by a moral framework. And that framework is not just Dario and Daniela Amodei making decisions in a vacuum. Daniel asked about the other key figures at the company, and I think it is important to look at the team they have built. It is a who is who of the people who actually built the modern A-I world.

Yeah, everyone knows the Amodeis—pronounced ah-mo-DAY-ee. Dario was the V-P of Research at OpenAI, and Daniela was the V-P of Safety and Policy. They left because they saw the commercial direction OpenAI was heading and wanted to build something different. But the bench is deep. You have Mike Krieger, the co-founder of Instagram, as the Chief Product Officer. That was a huge hire last year. It signaled that Anthropic was ready to move from a research lab to a product company.

And do not forget Jan Leike—pronounced YAHN LYE-kuh. He was the head of the Superalignment team at OpenAI before he left in that big public blow-up last year. He is now the Head of Alignment Science at Anthropic. Having him there is a huge signal that Anthropic is where the serious alignment work is happening. Then you have Jack Clark, who is the co-founder and Head of Policy. He is the one who basically invented A-I policy as a field. He has been tracking the impact of these models since the early days of G-P-T two.

I find Jared Kaplan's role the most interesting. He is the Chief Science Officer. If you have ever heard of the Scaling Laws for neural language models, he is the guy who co-authored that paper. He is the one who mathematically proved that if you just add more compute and more data, these models keep getting smarter. He is the architect of the scaling strategy that everyone, including OpenAI and Google, is now following.

And then there is Chris Olah—pronounced OH-lah. He leads mechanistic interpretability. While everyone else is focused on making the models bigger, Chris is focused on peering inside the black box. He wants to understand exactly which neurons are firing when a model thinks about a specific concept. If we are ever going to truly trust these models, it will be because of the work Chris is doing to make their internal logic transparent. He is essentially trying to build an X-ray machine for A-I thoughts.

It is a fascinating group. You also have Amanda Askell, who is the Lead Philosopher. I love that a three hundred and eighty billion dollar tech company has a Lead Philosopher. She is the one responsible for the model's character. When Claude sounds thoughtful or avoids being a jerk, that is Amanda's influence. And Dave Orr is the Head of Safeguards, making sure those constitutional principles are actually enforceable in the code.

It is a team built for a very specific mission. They are trying to solve the alignment problem while simultaneously building the most powerful computers in history. The conflict with the Pentagon is the natural result of that mission. If you build a model that is smart enough to be a strategic asset, the state is going to want to control it. But if the model's intelligence is fundamentally tied to its safety architecture, you cannot separate the two without breaking the model.

So, what is the takeaway for the people listening who are trying to navigate this? We have Claude four point six, we have the Mythos leak, and we have the federal court case. What should developers and business leaders be looking at over the next six months?

The first thing is latent width. As these models get more complex, the ones that maintain a dense architecture are going to have a leg up in high-stakes reasoning. If you are building something where accuracy and consistency are more important than cost per token, the dense approach is the clear winner. The second thing is the shift toward test-time compute. We are going to stop judging models just by how fast they respond. We are going to start asking, how much did the model think before it spoke? The ability to allocate reasoning tokens is the new frontier.

I also think the Tool Search Tool and Programmatic Tool Calling are the blueprint for the next generation of software. We are moving away from monolithic apps toward these fluid libraries of tools that an A-I agent can assemble on the fly. If you are a developer, you should not be building a chatbot. You should be building a library of high-quality tools that a model like Claude can discover and use.

And finally, we have to watch the legal battle. If Judge Lin's injunction holds, it sets a precedent that A-I companies have a right to bake their own ethical frameworks into their products, even when the government disagrees. It is a win for the idea of private digital constitutions. But if the Pentagon wins on appeal, we could see a future where there are two versions of every model: a civilian version with guardrails and a government version that is essentially uncensored. That is a very different world.

It is the difference between a world where the A-I has a conscience and a world where it just has a commander. It is a lot to think about, and honestly, I am glad we have the thought blocks to help us process it.

I am just glad I do not have to write an eighty-four-page constitution for my own thoughts. I can barely manage a grocery list.

Well, your grocery list is mostly just hay and carrots, so the alignment problem there is pretty straightforward.

Building on that, the complexity of these models is only going to increase. We are reaching a point where the distinction between a software update and a philosophical shift is vanishing. When Anthropic releases a new version of Claude, they are not just changing the code. They are updating the moral framework of a system that millions of people rely on.

It is a heavy responsibility, and one that the Amodeis seem to take more seriously than almost anyone else in the valley. Whether they can maintain that under the pressure of a three hundred and eighty billion dollar valuation and a Department of Defense lawsuit is the big question for twenty-six.

The Mythos leak suggests they are not slowing down. If they can hit seventy-two point five percent on O-S-World while maintaining their constitutional guardrails, they will have proven that safety is not a drag on performance. It is a catalyst for it.

I guess we will see if the Capybara can stay chill under all that heat. This has been a deep one. Thanks to everyone for sticking with us through the latent width and the legal filings. Big thanks to our producer Hilbert Flumingtop for keeping the gears turning behind the scenes.

And a huge thank you to Modal for providing the G-P-U credits that power this show and allow us to dive into these topics every week.

This has been My Weird Prompts. If you are enjoying the deep dives, leave us a review on your favorite podcast app. It really helps us reach more people who are interested in the intersection of A-I and the real world.

We will be back next time with whatever weirdness Daniel sends our way.

Stay curious.

See you then.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.