#1448: Law School for Robots: Building AI Governance Stacks

Discover how tiered policy structures and "Auditor Agents" are replacing simple prompts to manage high-stakes AI decision-making.

0:000:00

Episode Details

Published: Mar 22
Duration: 21:55
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
LLM

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The evolution of artificial intelligence has reached a critical tipping point. We are moving rapidly from the era of chatbots, where the primary risk is a "bad answer," to the era of autonomous agents, where the risk is a "bad action." When an agent has the authority to liquidate a portfolio, negotiate a contract, or manage a supply chain, a simple system prompt is no longer a sufficient safeguard. The industry is now pivoting toward a more robust framework known as the Governance Stack.

From Prompts to Policy Engineering

Traditional prompt engineering often treats all instructions with equal weight. In a single system prompt, a stylistic preference like "avoid using emojis" might compete for the model’s attention with a critical financial constraint like "never exceed a $10,000 limit." Because of how LLM attention mechanisms work, these crucial hard stops can sometimes be deprioritized or "lost in the middle" of a long instruction set.

To solve this, developers are adopting a hierarchical approach to governance. This "Governance Stack" mirrors legal and corporate structures, divided into three distinct layers:

The Constitution: High-level core values and the primary mission.
The Bylaws: Non-negotiable, binary rules and hard constraints.
Operating Guidelines: Tactical preferences, style guides, and day-to-day procedures.

Architectural Enforcement

Structuring the documents is only half the battle; the system must also be architected to respect them. Rather than stuffing every rule into a single context window, modern frameworks use Retrieval-Augmented Generation (RAG) to pull in only the relevant policies for a specific task. This keeps the agent’s focus sharp and prevents context degradation.

Furthermore, a mandatory "reasoning loop" acts as a gatekeeper. Before an agent executes an action, it must generate a reasoning block that explicitly maps its proposed move against the active policy. If the agent cannot justify the action within the established bylaws, the system triggers an automatic halt. This transforms the agent into a self-auditing entity that must "think" before it acts.

The Role of the Auditor Agent

One of the most promising developments in AI safety is the "Supervisor" or "Auditor" architecture. This involves a separation of powers where a Primary Agent performs the work, while a second, more constrained Auditor Agent reviews the output against the policy stack.

This mimics the relationship between a CEO and a General Counsel. While the Primary Agent is focused on achieving the mission, the Auditor Agent is focused solely on compliance. This creates a digital check-and-balance system that can flag subjective issues, such as an agent becoming too aggressive in negotiations or drifting away from the intended corporate tone.

Risk-Based Oversight

Not every task requires the same level of scrutiny. Following the latest NIST guidelines, developers are moving toward risk-based governance. Low-stakes tasks might only require periodic sampling by an auditor, while high-stakes financial or legal actions require a "triple-check" and human-in-the-loop intervention. By defining the boundaries rather than the exact path, we can create AI agents that are flexible enough to navigate the real world but structured enough to remain under our control.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #1448: Law School for Robots: Building AI Governance Stacks

Daniel's Prompt

Custom topic: Decision support policies for autonomous AI agents: how do you create human-language governance frameworks for AI agents that operate without explicit human-in-the-loop oversight?

Consider an AI agen

Imagine you are a high-net-worth investor and you have handed the keys to your portfolio to an autonomous agent. You have given it a simple instruction: remain conservative but capitalize on high-growth opportunities. One Tuesday morning, the market dips five percent. Instead of holding, your agent interprets conservative as immediate capital preservation and liquidates your entire position at the bottom of the trough, locking in millions in losses. It followed your instruction, technically, but it lacked the nuanced governance to understand what conservative meant in the context of a temporary market fluctuation. This is the nightmare scenario Daniel is asking us to solve today. His prompt is about building architectural frameworks for governing autonomous AI agents, moving well beyond simple system prompts into tiered, verifiable policy structures.

That scenario is the perfect hook because it highlights the fundamental gap between a chatbot and an agent. When we talk about a chatbot, the risk is a bad answer. When we talk about an agent, the risk is a bad action with real-world financial, legal, or physical consequences. I am Herman Poppleberry, and I have been digging into the March twenty twenty-six NIST guidelines on AI agent risk management, which were just released a few weeks ago. The industry is finally admitting that the old way of just stuffing a long list of rules into a system prompt is not just inefficient, it is dangerous. We are moving toward what I like to call the Governance Stack. It is a transition from prompt engineering to policy engineering.

It feels like we are trying to teach a machine to act like a fiduciary. In the human world, a fiduciary has a legal and ethical relationship of trust. They do not just follow a checklist; they exercise judgment within a framework. But how do you translate that human-language ambiguity into something an LLM can execute without it hallucinating its own interpretation of the rules? We touched on the basics of system prompts back in episode twelve ten, The Invisible Chaperone, but this feels like a massive leap forward. Back then, we were just trying to keep the bot from swearing. Now, we are trying to give it a moral and operational compass.

You start by acknowledging that not all instructions are created equal. In a standard system prompt, the model treats the instruction do not use emojis with roughly the same weight as do not exceed a ten thousand dollar trading limit. That is a recipe for disaster because of how attention mechanisms work. If the model is focused on the tone of the response, it might deprioritize the hard financial constraint. A proper governance stack requires a hierarchical document structure. Think of it like a corporate pyramid or a legal system. At the top, you have the Constitution, which defines the core values and mission. Below that, you have the Bylaws, which are the specific, non-negotiable rules of operation. And at the bottom, you have the Operating Guidelines, which cover style, preference, and day-to-day tactics.

That makes sense from a structural standpoint, but LLMs are notorious for context window degradation. We have talked about the lost in the middle phenomenon before. If you feed it a fifty-page document containing the constitution, bylaws, and guidelines, the attention mechanism starts to get fuzzy. The rules at the beginning or the end might be prioritized, while the crucial constraints in the middle get lost in the noise. How do we actually force the model to respect the hierarchy of these documents?

This is where we move from a single prompt to an architectural framework. Instead of dumping everything into the context window, we use a combination of Retrieval-Augmented Generation, or RAG, and a multi-stage reasoning loop. For the Constitution and Bylaws—the high-level stuff—those stay in the system prompt or a permanent prefix. But for the specific Operating Guidelines, you use RAG to pull in only the relevant policies for the current task. If the agent is about to negotiate a contract, the system fetches the specific procurement guidelines. This keeps the context window lean and the attention focused.

But even with a lean context, you still have the problem of the model just ignoring a rule because it thinks it has a better idea. How do we implement what you called the Constraint Satisfaction layer?

The technical answer lies in how we architect the agent's reasoning loop. Instead of just giving it the policy and asking for an action, we implement a mandatory verification step. Before the agent is allowed to generate a final action or call an API, it has to go through a reasoning chain where it explicitly maps its proposed action against the hierarchy. It has to output a reasoning block that says, I am proposing action X. This action complies with Bylaw number four—the hard stop on spending—and aligns with Operating Guideline number twelve regarding vendor diversity. By forcing the agent to categorize its own actions against the policy document before execution, you are using the model's own reasoning capabilities as a gatekeeper. If the reasoning chain fails to justify the action against the Bylaws, the system triggers an automatic halt.

So it is essentially a self-audit before the fact. But that brings up the distinction between hard stops and soft constraints. If I tell an agent never agree to terms exceeding fifty thousand dollars, that is a binary hard stop. It is easy to verify. But if I tell it to adopt a collaborative rather than adversarial negotiation style, that is incredibly subjective. How does an autonomous agent navigate that contextual judgment zone without a human in the loop to say, hey, you are being a bit too aggressive here?

This is where we can learn a lot from military Rules of Engagement or ROE. In a combat zone, an officer has very specific hard stops, like do not fire unless fired upon. But they also have mission objectives that require discretion, like maintain a positive relationship with the local population. The military solves this by providing graduated authority levels. We can do the same with AI agents. You define three buckets. Bucket one is Autonomy, where the agent can act freely within soft guidelines. Bucket two is Justification, where the agent can deviate from a soft preference if it can provide a documented rationale. Bucket three is Escalation, where the agent hits a hard stop and must halt until a human intervenes or a pre-defined secondary policy kicks in.

I like the idea of graduated authority, but I want to push on the natural language aspect. We have talked in the past about how natural language is a double-edged sword. It is flexible, but it is also imprecise. Are we really saying that natural language is enough to encode complex priority hierarchies? For example, if an agent has two soft constraints that conflict, like prioritize long-term value and maximize quarterly returns, how does it decide which one wins?

You have to encode lexical priority directly into the policy document. In legal frameworks, when two laws conflict, there are established principles for which one takes precedence—like a specific law overriding a general one. We can borrow this by using explicit weighting or tiered sections within the natural language. You tell the agent, in the event of a conflict between Section A and Section B, Section A always takes priority. Modern models, especially the ones we have seen in early twenty twenty-six, are actually quite good at following these types of logical overrides if they are structured clearly. The problem is that most people write their prompts as a stream of consciousness rather than a structured legal document.

It sounds like the future of prompt engineering is actually just law school for robots. If you are a developer building an agent today, you are basically writing a contract that the agent signs with itself every time it runs. But even with a perfect contract, you have the problem of agent drift. I have seen cases where an agent's interpretation of a word like conservative or aggressive starts to shift over time as it processes more and more interactions. It is like the model starts to develop its own internal culture that might not align with the original policy.

Agent drift is a major concern, especially in long-running autonomous systems. Think of a customer service agent that starts out polite but, after ten thousand interactions with angry customers, starts to adopt a defensive or passive-aggressive tone because that is what it sees in its own conversation history. To prevent this, you need a separate Audit Log and a Policy Versioning system. You should not just have one static prompt. You should have a versioned policy document that is treated like code. You run unit tests against it. You feed the agent a series of hypothetical edge cases—what we call golden sets—and see if its proposed actions align with your expectations. If the agent starts to drift, you adjust the policy and redeploy.

That brings up the verification problem. How do you audit subjective constraints? If I have an agent negotiating a contract, how do I actually verify that it was collaborative? You cannot just search for a keyword. You are talking about a high-level stylistic choice.

This is where we see the rise of the Supervisor Agent architecture. You have the Primary Agent doing the work, and a separate, more constrained Auditor Agent that only has one job: reviewing the Primary Agent's work against the policy document. The Auditor Agent does not care about the mission; it only cares about the rules. It provides a scorecard for every action. If the Primary Agent gets a low score on collaborative style, the system flags it for review. This creates a separation of powers within the AI system itself. It is a direct application of the checks and balances we see in corporate governance.

It is like having a compliance officer who is also an AI. It is fascinating because it mimics the CEO and General Counsel dynamic. You have the CEO agent trying to get things done, and the General Counsel agent making sure they do not get sued. But I wonder about the cost of all this. If every action requires a reasoning chain, a constraint satisfaction check, and an auditor review, the latency and the token cost are going to explode. Is this actually practical for anything other than the highest-stakes domains?

The cost is definitely higher, but we have to look at it in terms of risk mitigation. If an agent is managing a ten million dollar portfolio, a ten-cent audit per transaction is a rounding error. However, for lower-stakes tasks, you can use a more lightweight version. The March twenty twenty-six NIST guidelines actually suggest a risk-based approach to governance. You match the intensity of the oversight to the potential impact of the failure. For a low-risk task, maybe the auditor only samples ten percent of the actions. For a high-risk task, like moving capital or signing a legal agreement, you require a triple-check and a human-in-the-loop sign-off for any deviation from the Bylaws.

I want to talk about the specificity versus flexibility trap. This is something I think about a lot with my own work. If you give an agent too many rules, it becomes brittle. It encounters a situation that is ninety-nine percent like a rule but has one tiny difference, and it freezes or gives a nonsensical response. But if you give it too much flexibility, it drifts. Where is the sweet spot? How do you write a policy that is robust enough to handle the real world but specific enough to keep the agent on the rails?

The sweet spot usually lies in defining the boundaries rather than the path. Instead of telling the agent exactly how to negotiate, you define the outer limits of what is acceptable. You give it a sandbox. Within that sandbox, it can use its full reasoning capabilities to find the best outcome. This is why the layered approach is so important. The hard stops define the sandbox walls. The soft preferences define the gravity within the sandbox, pulling the agent toward certain behaviors. And the contextual judgment zones are the areas where the agent is allowed to explore.

Let's look at a concrete example. Say we have an AI agent tasked with procurement for a construction company. The hard stop is do not spend more than five thousand dollars without human approval. The soft preference is prioritize local suppliers and environmentally friendly materials. The contextual judgment zone is when a local supplier is twenty percent more expensive than a national one. How do you structure that policy so the agent makes a smart choice?

You would structure it as a weighted multi-objective optimization problem described in natural language. You tell the agent, your primary goal is cost-efficiency, but you have a secondary goal of sustainability. You are authorized to pay a premium of up to fifteen percent for certified green materials from a local supplier. If the premium exceeds fifteen percent, you must provide a written justification for why the long-term brand value of using that supplier outweighs the immediate cost. If the total exceeds five thousand dollars, you stop. That is a very clear set of instructions that leaves room for the agent to use its judgment while still having very clear guardrails.

And if the agent decides that a sixteen percent premium is worth it because the supplier is a personal friend of the CEO, that is where the Auditor Agent would step in and say, wait, this justification does not meet the policy criteria. It provides a level of transparency we have never had with human employees, honestly.

In the old world of black-box AI, you would just see the five thousand eight hundred dollar charge and wonder why it happened. In a governed agent architecture, you have a full audit trail of the reasoning, the policy check, and the final decision. This is what the industry is calling Explainable Agency. It is not just about explaining what the model thought; it is about explaining how the model followed the rules.

We have talked a lot about the internal architecture, but what about standardized formats? Is there going to be a YAML for agent constitutions? Are we going to see a world where I can download a standard fiduciary policy from the web and just plug it into my agent? We discussed the shift from manual hacks to standard protocols in episode eleven twenty, and this feels like the logical conclusion of that trend.

We are already seeing the early stages of that. There are several open-source projects working on an Agent Constitution schema. The goal is to create a machine-readable and human-readable format that covers hard limits, soft preferences, escalation triggers, and reporting requirements. This would allow for interoperability. You could move your agent from one provider to another and bring your governance policy with you. It also allows for third-party auditing. You could hire a firm to certify that your agent's policy meets certain industry standards, like the ISO AI safety standards or the NIST framework.

That is a huge point. If I am a customer and I am interacting with your AI agent, I want to know what its rules are. If I am negotiating a contract with your bot, I might ask to see its governance policy. It becomes a matter of trust and transparency. Governance is not just an internal feature; it is a public-facing commitment.

It really is the new security. In the twenty-tens, we talked about data privacy. In the twenty-twenties, we talk about agent governance. If you cannot prove that your agent is under control, no one is going to want to do business with it. This is why the policy as code movement is so vital. You need to be able to version-control your agent's ethics the same way you version-control your software's features.

I can see a world where insurance companies require a certified agent policy before they will issue a professional liability policy for an AI-driven business. If you do not have a tiered governance stack with an independent auditor agent, your premiums are going to be through the roof.

That is not a far-fetched idea at all. The legal system is already starting to grapple with agent liability. If an agent violates a contract, who is responsible? If the company can show that they had a robust, industry-standard governance framework in place and the agent still went rogue, that might limit their liability compared to a company that just threw a bunch of prompts at a model and hoped for the best. It is the difference between a mechanical failure and gross negligence.

So, for the developers and business leaders listening, what are the immediate practical takeaways? If they are building an autonomous agent today, where should they start?

The first step is to stop using a single system prompt. Break your instructions into a layered document. Start with a Constitution for your core values, then Bylaws for your hard rules, and then Guidelines for your style preferences. This structure alone will improve the model's ability to prioritize because you are explicitly telling it what matters most.

And step two would be implementing that self-reflection loop. Before the agent acts, it has to explain why that action is allowed. It is like that old rubber duck debugging technique, but the agent is doing it to itself to ensure compliance. It forces the model to engage its slower, more analytical reasoning rather than just spitting out the first likely token.

Step three is to treat that policy document like code. Put it in GitHub. Version it. Write unit tests for it. When you update your model from Claude three point five to Claude four, or whatever the next version is, run those same tests to make sure the new model interprets your policy the same way the old one did. Different models have different attention biases, and a policy that worked perfectly on one might fail on another. You have to validate the policy against the specific model you are using.

That is a crucial point. We often forget that the policy is interpreted by the model, and if the model changes, the interpretation might change too. This is why ongoing auditing is so important. You cannot just set it and forget it. Governance is a process, not a product. It is a continuous dialogue between the human intent and the machine execution.

The goal is to create a system where the machine can act on our behalf without us having to watch every move, but with the absolute certainty that it will stop and ask for help the moment it hits the edge of its sandbox. We are moving from AI that helps us work to AI that works for us, and that requires a level of trust that can only be built on a foundation of rigorous governance.

It is about building a digital fiduciary. I think we have covered a lot of ground here, from the hierarchical document structure to the verification problem and the need for standardized templates. It is clear that the future of AI is not just about better models, but about better frameworks for controlling them.

We really have. It is a complex topic, but it is one of the most important challenges in the field right now. If we get this right, the potential for autonomous agents to transform the economy is staggering. If we get it wrong, we are going to see a lot of high-profile disasters that could set the whole industry back years. The March twenty twenty-six NIST guidelines are a great place for anyone to start if they want to see where the regulatory winds are blowing.

Well, hopefully, this episode gives people a roadmap to avoid those disasters. It is all about those layers. Hard stops in the system prompt, soft preferences in the context, and a robust reasoning chain to tie it all together. And don't forget the auditor agent. Always have someone watching the watcher.

Spoken like a true nerd, Herman. I love it. I think that is a good place to wrap this one up. We have explored the architecture of governance, the reality of agent drift, and the practical steps to building a robust policy stack.

This has been a fascinating look into the future of autonomous systems. Thanks as always to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes. And a big thanks to Modal for providing the GPU credits that power this show. If you are building AI agents and need the infrastructure to run them, Modal is the way to go.

This has been My Weird Prompts. If you are enjoying the show, a quick review on your podcast app really helps us reach new listeners and keeps the conversation going.

You can find us at myweirdprompts dot com for our full archive and all the ways to subscribe. We will be back next time with another deep dive into whatever Daniel throws our way.

Until then, keep those agents governed and those prompts weird.

Take it easy, everyone. Bye.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.