#2182: Can You Actually Review an AI Agent's Plan?

Most AI agents have plans the way you have a plan while half-asleep—something's happening, but you can't see it. We map the five major planning pat...

0:000:00
Episode Details
Episode ID
MWP-2340
Published
Duration
25:39
Audio
Direct link
Pipeline
V5
TTS Engine
chatterbox-regular
Script Writing Agent
claude-sonnet-4-6

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

|

The Plan Problem: Why Most AI Agents Hide Their Thinking

When we talk about AI agents "planning," we're usually talking about something invisible. The agent has a strategy, sure—but you can't see it, touch it, or stop it before execution. That gap between "the agent has a plan" and "you can review the plan" is the central tension in agent design today.

Five Planning Patterns

ReAct: The Workhorse

ReAct (Reasoning + Acting) is the baseline. Presented at ICLR 2023 by researchers at Princeton and Google Brain, it works through a simple loop: the agent thinks about what to do, takes an action, observes the result, and repeats. The "plan" in ReAct is just the accumulating scratchpad—a running trace of thoughts and observations in the context window.

The advantage is real: grounding reasoning in actual tool outputs reduces hallucinations compared to pure chain-of-thought. The problem is timing. By the time you see what the agent is doing, it's already doing it. The plan is ephemeral, existing only as you read it—there's no discrete moment to pause and review.

Plan-and-Execute: Separating Thinking from Doing

Plan-and-execute addresses this directly. A planner LLM generates a numbered list of steps before any action happens. Then an executor agent works through each step. The performance difference is measurable: GPT-4 hitting around 92% task completion with plan-and-execute versus 85% with ReAct. The tradeoff is cost—more tokens, more API calls.

Crucially, the plan exists as text. It could theoretically be intercepted and reviewed. Most implementations don't build that gate in, though. The plan sits there, but nobody's asking a human whether it looks right.

ReWOO: Programs Instead of Loops

ReWOO generates the entire tool call sequence upfront, using placeholder variables for future results. Think of it as a program: use Google to find X, assign to E1, use an LLM to extract from E1, assign to E2, find E2's hometown. Then a worker module executes each step and a solver synthesizes the answer.

The efficiency gain is real—you avoid ReAct's repeated prompt overhead where the full dialogue history gets sent to the LLM at every step. But like ReAct, the plan is typically generated and executed immediately. It's not persisted or made reviewable. It's a program that runs once and disappears.

Tree-of-Thought: Exploring Multiple Paths

Tree-of-thought is what you use when the problem has a search space—puzzles, optimization problems, creative tasks where a single reasoning chain might hit a dead end. Instead of one chain, the agent explores multiple candidate next steps, scores each, prunes the weak ones, and builds deeper from survivors.

The problem is intuition. A tree of reasoning paths isn't something you hand to a human and ask "does this look right?" It's expensive too—multiple LLM calls per step. A 2024 extension called LATS combines tree-of-thought with Monte Carlo Tree Search and brings us to the fifth pattern.

Reflexion: Self-Critique and Memory

Reflexion, from NeurIPS 2023, adds a self-critique loop. An actor generates actions, an evaluator scores the output, and a self-reflection module writes natural language analysis of what went wrong. On the next attempt, the actor gets its previous reflections as context. Performance improvements on coding benchmarks are 10-20 percentage points over baseline.

This is the first pattern where the plan gets revised based on failure. But here's the catch: Reflexion is entirely internal. The reflections aren't exposed for human review. The agent runs its own retrospectives, and you're not invited.

The Self-Critique Problem

There's a sharp limitation to self-critique that an EMNLP 2025 paper makes clear: LLMs generate plausible but incorrect content with high internal self-consistency. A model can confidently defend a wrong answer through multiple reflection rounds. The property that makes reasoning seem coherent is the same property that prevents the model from catching errors it's confident about. You can run Reflexion ten times and get ten iterations of the same confident mistake.

Self-critique catches certain error classes and is blind to others. It's not a substitute for human review.

Making Plans Visible: The Infrastructure Layer

This is where the practical frameworks diverge sharply.

LangGraph: Plans as Typed State

LangGraph treats agent workflows as state machines—graphs where nodes are processing steps and edges define control flow. State is a typed data structure (TypedDict or Pydantic model) carrying all the information the agent needs. Every step reads from and writes to a checkpoint.

You can literally define a state class with a plan field, an approved field, and an execution results field. The plan gets written by the planner node, the approved flag gets set by human review, and the executor only runs when approved is true.

The interrupt mechanism, announced December 2024, makes this concrete. You call interrupt_after equals planner to pause execution right after the planner node runs, before execution touches anything. At that point you call get_state, see the plan, and decide. If you approve, you call invoke and execution resumes. If you want to edit, you call update_state with a revised plan. The thread ID system means you can pause for hours or days—state persists to a database.

A production travel booking agent built with LangGraph and Streamlit shows this in action. The agent generates a TravelPlan as a Pydantic model with fields for origin, destination, dates, budget, preferences. The graph pauses at a wait-for-approval node. The human sees the plan as editable JSON in the Streamlit UI, can edit it, approve or reject. Only after explicit approval does the execution node run.

The gap: LangGraph gives you the plumbing but no built-in review UI. You're building the Streamlit app yourself.

AutoGen: Plans Emerge from Dialogue

AutoGen, from Microsoft, takes a completely different approach. Plans aren't explicit data structures—they emerge from dialogue between agents. An AssistantAgent writes code and plans in natural language. A UserProxyAgent can execute code and request human input. Human input modes are ALWAYS, NEVER, or TERMINATE, controlling when the proxy asks for human input.

To review the plan, you read the conversation and figure out what the agent said it was going to do. Not a terrible workflow for a developer already monitoring the conversation, but not a structured approval gate either. The plan is implicit in dialogue, not a first-class object.

Claude Code's Plan Mode: Read-Only Reasoning

Claude Code's plan mode is elegant in its constraints. When in plan mode, Claude can read files, glob for patterns, grep file contents, search the web—but cannot write files, edit files, run shell commands, or execute anything. It's pure read-only reasoning.

The human sees Claude's analysis and proposed approach before any mutation happens. It's the most conservative design, but also the most human-readable. The plan is just Claude's explanation of what it would do.

The Open Question

The infrastructure for human-in-the-loop agent planning exists. LangGraph's checkpointing and interrupt system is genuinely solid. The question is adoption and UX. Most teams deploying agents today aren't building review gates. The plan-and-execute pattern is faster and cheaper than ReAct, but the performance gain comes partly from the explicit planning step—and that step is usually invisible.

The vision of treating an agent's plan like a pull request—review it, comment on it, approve or reject it, then let it execute—is technically achievable. Whether it becomes standard practice is a different question.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3
Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2182: Can You Actually Review an AI Agent's Plan?

Corn
Here's what Daniel sent us this week. He wants to dig into agent planning strategies — specifically, how AI agents decide what to do before they do it, and whether humans can actually see and review those plans. He's asking about the major planning patterns: ReAct, plan-and-execute, tree-of-thought, reflexion. But the deeper question is practical: how do agents represent plans as state, do those plans get saved somewhere a human can actually touch, and can we get to a world where you treat an agent's plan like a pull request — review it, comment on it, approve or reject it, then let the agent execute? Lot to unpack here.
Herman
Herman Poppleberry, by the way. And yeah, this one gets at something I think is genuinely underappreciated — the gap between an agent that has a plan and an agent whose plan you can actually see and edit before it starts breaking things.
Corn
Because those are very different things.
Herman
Completely different. And the framing I keep coming back to is this: most agents today have plans the way you have a plan when you're half-asleep. There's something going on in there, but you couldn't write it down if someone asked you to.
Corn
That's a disturbing image. Let's start with the patterns themselves, because I think the academic framing matters here for understanding why the practical implementations look the way they do. ReAct is the baseline, right?
Herman
ReAct is the workhorse. Yao et al. out of Princeton and Google Brain, presented at ICLR 2023. The core loop is alternating thought, action, observation — the agent thinks about what to do, takes an action, observes the result, and repeats. The prompt template is almost comically explicit: "Thought: you should always think about what to do. Action: the action to take. Action Input: the input to the action. Observation: the result of the action." And you repeat that until you get a final answer.
Corn
So the plan is just... the scratchpad. The running trace of thoughts and observations.
Herman
That's the key insight. The plan in ReAct is not a discrete artifact. It lives in the context window as an accumulating sequence of reasoning steps. It reduces hallucinations compared to chain-of-thought alone because the agent is grounding its reasoning in real tool outputs — but the plan itself is ephemeral. By the time you see what the agent is doing, it's already doing it.
Corn
Which is fine if you trust the agent completely and the task is low stakes. Less fine if you're deploying this against a production database.
Herman
Right. And that's why plan-and-execute exists. The idea is to separate the planning phase from the execution phase. A planner LLM generates a numbered list of steps before any action is taken — step one: find this, step two: do that, step three: compile. Then an executor agent works through each step. The key advantage is that the plan exists as a discrete artifact before execution begins. On complex tasks with GPT-4, plan-and-execute hits around ninety-two percent task completion accuracy versus eighty-five percent for ReAct. The tradeoff is cost — you're spending more tokens and making more API calls.
Corn
So the plan-and-execute approach is better at the task, but costs more. And the plan actually exists as text you could theoretically intercept.
Herman
Theoretically. Most implementations don't build that interception gate in. The plan exists, but nobody's stopping to ask a human whether it looks right.
Corn
What about ReWOO? Because that one's interesting — it generates the entire tool call sequence upfront, right?
Herman
ReWOO is clever. Instead of the alternating thought-action loop, the LLM generates the entire sequence of tool calls in one pass, using placeholder variables for future results. So you'd get: use Google to find the Australian Open winner, assign that to E1, use an LLM call to extract the name from E1, assign that to E2, find E2's hometown, and so on. Then a worker module executes each step, and a solver synthesizes the final answer.
Corn
The plan is almost like a program. It has variables and dependencies.
Herman
The efficiency gain is real — you avoid the repeated prompt overhead of ReAct, where you're sending the entire dialogue history to the LLM at every step. With ReWOO, the LLM is called once for planning, once per tool execution, once for synthesis. But like ReAct, the plan is still typically generated and executed immediately. It's not persisted, it's not made reviewable. It's a program that runs once and disappears.
Corn
And then there's tree-of-thought, which is the expensive one.
Herman
Tree-of-thought is what you reach for when the problem has a search space — puzzles, optimization, creative tasks where a single reasoning chain might walk straight into a dead end. Instead of one chain of reasoning, the agent explores multiple candidate next steps at each point, scores each candidate, prunes the weak ones, and builds the tree deeper from the survivors. The LangGraph tutorial uses the twenty-four game as the canonical example — make the number twenty-four from four given numbers using basic arithmetic. A single chain of thought will often fail. A tree search finds the solution.
Corn
The plan in tree-of-thought is a tree of reasoning paths. Which is not exactly something you can hand to a human and say "does this look right to you?"
Herman
Not in any intuitive way. And it's expensive — multiple LLM calls per step. There's a 2024 extension called LATS, presented at ICML, that combines tree-of-thought with Monte Carlo Tree Search and something called Reflexion, which brings us to the fifth pattern.
Corn
Reflexion is the self-critique one.
Herman
Reflexion from Shinn et al. at NeurIPS 2023. Three components: an actor that generates actions, an evaluator that scores the output, and a self-reflection module that generates natural language analysis of what went wrong. On subsequent attempts, the actor gets its previous reflections as context. The GitHub repo has about three thousand stars. Performance improvements on coding benchmarks are in the range of ten to twenty percentage points over baseline.
Corn
So the agent is basically writing a postmortem after each failure and using that as context for the next try.
Herman
Which is actually the first pattern where the plan gets revised based on failure. The verbal memory buffer carries forward what didn't work. But here's the thing — and this is important — Reflexion is still entirely internal. The reflections are not exposed for human review. The agent is running its own retrospectives, and you're not invited.
Corn
And there's a fundamental problem with that, right? Because the agent's self-critique is only as good as the agent's ability to recognize its own errors.
Herman
There's an EMNLP 2025 paper that makes this point sharply. LLMs generate plausible but incorrect content with high internal self-consistency — a model can confidently defend a wrong answer through multiple reflection rounds. The self-consistency that makes the model's reasoning seem coherent is the same property that prevents it from catching errors it's confident about. You can run Reflexion ten times and get ten iterations of the same confident mistake.
Corn
So self-critique is not a substitute for human review. It catches a certain class of errors and is blind to another class.
Herman
That's the upshot. Which is why the question of whether plans get made visible matters so much. By the way, today's episode is running on Claude Sonnet 4.6, so if the script seems unusually well-organized, that's why.
Corn
Credit where it's due. Okay, so let's talk about the practical side — how do the actual frameworks handle plan persistence? Because LangGraph is the one that seems most architecturally serious about this.
Herman
LangGraph is doing something genuinely different. The model is that agent workflows are state machines — graphs where nodes are processing steps and edges define control flow. State is a typed data structure, either a TypedDict or a Pydantic model, that carries all the information the agent needs. And every step of the graph reads from and writes to a checkpoint of that state.
Corn
So the plan isn't just a thought — it's a typed field in a data structure that gets checkpointed.
Herman
You can literally define a state class with a plan field, an approved field, and an execution results field. The plan gets written by the planner node, the approved flag gets set by a human review step, and the executor node only runs when approved is true. One Medium article described LangGraph as feeling like Git for AI agents — checkpoints are like commits, you can fork from any checkpoint, resume from any point.
Corn
That metaphor actually extends pretty far. If checkpoints are commits, then an interrupt before the executor node is like a branch protection rule.
Herman
That's the interrupt mechanism, announced December 2024. It's modeled on Python's input function — the graph just pauses and waits. You can set interrupt_after equals planner to pause execution right after the planner node runs, before the executor touches anything. At that point, you call get_state on the graph, you see the plan, you read it, you decide. If you approve, you call invoke with None and execution resumes. If you want to edit the plan, you call update_state with a revised plan and then resume. The thread ID system means you can pause for hours or days — the state is persisted to a database backend.
Corn
And there's a real production example of this.
Herman
A travel booking agent built with LangGraph and Streamlit, from a MarkTechPost writeup in February 2026. The agent generates a TravelPlan as a Pydantic model — fields for trip title, origin, destination, dates, travelers, budget, preferences, lodging nights, all of it. The graph pauses at a wait-for-approval node using interrupt. The human sees the plan as editable JSON in the Streamlit UI. They can edit the JSON, approve or reject. Only after explicit approval does the execute-tools node run. The quote from the article is: "By combining LangGraph interrupts with a Streamlit frontend, we create a workflow that makes agent reasoning visible, controllable, and trustworthy instead of opaque and autonomous."
Corn
The gap being that LangGraph gives you all the plumbing but no built-in review UI. You're building the Streamlit app yourself.
Herman
That's the gap. LangGraph is infrastructure, not a product. The checkpointing system, the interrupt mechanism, the update-state capability — all there. The human-facing review interface is your problem.
Corn
Now, AutoGen takes a completely different approach.
Herman
AutoGen is a multi-agent conversation framework from Microsoft. Plans aren't explicit data structures — they emerge from dialogue between agents. The AssistantAgent writes code and plans in natural language. The UserProxyAgent can execute code and solicit human input. Human input modes are ALWAYS, NEVER, or TERMINATE, which controls when the proxy asks a human for input. AutoGen 0.4 supports save-state and load-state for persistence, but what you're saving is the entire conversation history, not a structured plan artifact.
Corn
So to review the plan in AutoGen, you'd have to read the conversation and figure out what the agent said it was going to do.
Herman
Which is not a terrible workflow for a developer who's already monitoring the conversation. But it's not a structured approval gate. The plan is implicit in the dialogue, not a first-class object.
Corn
Let's talk about Claude Code's plan mode, because I think this is the most interesting human-in-the-loop design from a product standpoint.
Herman
Plan mode is elegant in its constraints. When you're in plan mode, Claude can read files, glob for patterns, grep file contents, search the web — but it cannot write files, edit files, run shell commands, or execute anything. It's a read-only operating mode. The agent analyzes your codebase, asks clarifying questions, and generates a detailed implementation plan. Then it stops.
Corn
And the plan gets written to a file.
Herman
A markdown file in a plans directory, with auto-generated names like jaunty-petting-nebula.md. You can open it with Ctrl-G, edit it in your text editor, save it, and Claude picks up the changes. You can configure it to write plans to a project-local directory like docs/plans and commit them to version control.
Corn
So the plan becomes a version-controlled artifact. It's a decision record.
Herman
That's exactly how Anthropic's official best practices frame it. The recommendation from Boris Cherny, who created Claude Code, is a four-phase workflow: explore in plan mode, plan in plan mode, implement in normal mode, commit in normal mode. And the plan file committed to git becomes the PR description for significant features.
Corn
There's a practical efficiency argument here that's separate from the safety argument. A developer tracked their sessions and found that when they skip planning, they redo the task from scratch roughly forty percent of the time. With plan mode, a task that took thirty-five minutes of trial and error took twelve minutes.
Herman
That forty percent redo rate is the number that should make everyone stop and think. Planning isn't just about oversight — it's about not wasting time. The agent that plans first is faster, not slower. And there's a one-sentence rule from Anthropic's best practices: if you can describe the exact diff in one sentence, skip the plan. If you can't, plan first.
Corn
That's a genuinely useful heuristic. Now, Armin Ronacher — the Flask creator — reverse-engineered plan mode and found something interesting about how it's actually implemented.
Herman
This is where it gets philosophically interesting. Ronacher found that plan mode is, at its core, a prompt injection plus a state machine for entering and exiting the mode. The tools for writing files are still present in plan mode — they're in the tool list. It's prompt reinforcement that restricts their use, not structural removal. The agent is told not to write files, and it mostly complies. There's an ExitPlanMode tool that signals readiness for user review, but the agent can enter and exit plan mode itself.
Corn
So the enforcement is behavioral, not architectural. The agent is choosing not to write files because it's been instructed not to.
Herman
Which raises the question Ronacher poses directly: should planning constraints be enforced structurally, where the planning agent literally doesn't have write tools, or via prompting? OpenHands takes the structural approach.
Corn
Walk me through OpenHands.
Herman
OpenHands, formerly OpenDevin, uses an event-stream architecture with a clean two-phase planning workflow. Phase one: a planning agent with read-only tools — glob, grep, and a PlanningFileEditorTool that can only write to PLAN.md. Nothing else. Phase two: an execution agent with full editing capabilities reads PLAN.md and implements it. The planning agent physically cannot modify any file except PLAN.md. That's structural enforcement.
Corn
And PLAN.md has a defined structure enforced by the system prompt. Objective, context summary, approach overview, implementation steps, testing and validation.
Herman
Five sections, every time. The plan is a typed artifact with a schema. Compare that to Claude Code's plan, which is unstructured markdown. Both are persistent files on disk, both can be committed to version control, but OpenHands's plan has enforced sections that make it easier to review systematically.
Corn
The gap in OpenHands is that there's no built-in human approval gate between the planning phase and the execution phase. You'd need to wire that in yourself.
Herman
You'd add a step between the two phases where a human reads PLAN.md and either approves or edits before the execution agent starts. The infrastructure for the artifact is there; the approval workflow is not.
Corn
Let's talk about Devin, because Devin's model is interestingly different from all of these.
Herman
Devin is Cognition Labs' autonomous coding agent, now at twenty dollars a month after dropping from five hundred dollars with Devin 2.0 in April 2025. The planning mechanism is a to-do list — a simple scrollable sidebar in the Devin UI showing current task status. It's visible while the agent is executing, not before.
Corn
So the review doesn't happen at the plan stage. It happens at the pull request stage.
Herman
The PR is the review artifact. Devin works autonomously, generates code, and opens a draft PR for your review. You're reviewing the output, not the plan. The to-do list is a progress tracker, not a pre-execution approval gate.
Corn
And the PR merge rate went from thirty-four percent to sixty-seven percent in one year, according to Cognition's 2025 annual review.
Herman
Sixty-seven percent of PRs now merged, nearly double the prior year. The efficiency numbers are striking — security vulnerabilities that take human developers an average of thirty minutes take Devin an average of ninety seconds. Language migrations running ten to fourteen times faster than human engineers. Goldman Sachs, Santander, Nubank are customers.
Corn
But thirty-three percent of PRs still get rejected. And all that work happened before the rejection. If Devin had a plan review step, you could potentially catch some of those rejections before the agent spent time on a doomed implementation.
Herman
That's the counterfactual question. What if Devin showed you its to-do list and asked for approval before starting? The to-do list exists — it's right there in the UI. Adding an approval gate before execution starts is not technically hard. Whether Cognition thinks it's the right product decision is a different question.
Corn
GitHub Copilot Coding Agent is the other major production system here, and it's similar to Devin in that the review happens at the PR stage.
Herman
Assign a GitHub issue to Copilot, it works in the background, opens a draft PR. The GitHub community best practices from December 2025 are interesting — AI opens draft PRs only, no direct pushes to protected branches, every PR must have a clear problem statement and agent-generated rationale for the approach chosen. Tag AI-generated PRs and require extra approvals or higher coverage thresholds.
Corn
There's ACM research from March 2026 on this — analysis of twenty-nine thousand-plus PRs across five AI coding tools. What did it find?
Herman
The framing from that paper is that agents mainly initiate work while humans keep merge authority. And there's a bimodal outcome pattern — AI-generated PRs either get merged quickly or fail iteratively, contrasting with the more gradual review distribution for human PRs. The human is still the final gate, but the review is happening at the end of the process, not the beginning.
Corn
So let's pull back and ask the big question Daniel is pointing at. The vision of treating an agent's plan like a pull request — review it, comment on it, approve or reject it, then execute. What would that actually require?
Herman
Six things, roughly. A structured plan format — not just markdown but typed fields, step dependencies, maybe a schema. Inline commenting on specific plan steps. Version history of plan revisions. Approval gates with named approvers. Automatic execution upon approval. And rollback if execution diverges from plan.
Corn
How close does anything get to all six?
Herman
LangGraph gets closest on the infrastructure side. Typed state, checkpointing, interrupt for approval gates, update-state for editing, thread IDs for async review, resume for execution. But you're building the review UI yourself, there's no inline commenting system, and there's no built-in version history of plan revisions beyond what you implement in your state schema.
Corn
Claude Code plan mode gives you the persistent artifact and version control integration, but the plan is unstructured and there's no approval tracking.
Herman
OpenHands gives you the structured artifact and structural enforcement of the planning constraint, but no built-in approval gate. Devin and Copilot give you the review artifact, but it's the output, not the plan. Nobody has all six.
Corn
The thing that strikes me is that the tooling exists to build this. LangGraph's interrupt plus a decent web frontend plus a structured plan schema — you could build the plan-as-pull-request workflow today. It's not a research problem, it's an engineering problem.
Herman
And there are teams doing it. The LangGraph FastAPI pattern is essentially that — a thread ID system where you POST to start a workflow, get back an interrupted state with the plan, do your review, and POST to resume. The gap is that nobody has productized this into a clean developer experience.
Corn
The CLAUDE.md point is worth making here too. Plan mode quality is directly proportional to how good your CLAUDE.md is. Without it, Claude might plan to add Swagger when you use Scalar, or use DateTime.Now when your standards require DateTimeOffset.UtcNow.
Herman
The plan is only as good as the context the agent has about your codebase and conventions. CLAUDE.md is how you front-load that context. With a well-crafted CLAUDE.md, Claude's plan already follows your conventions from the first draft, which means fewer revision cycles before approval.
Corn
What's the practical takeaway for someone building agents today? If you're using LangGraph, where do you start?
Herman
Start with interrupt_after on your planner node. That's one line of code that adds a human review gate between planning and execution. Add a plan field to your state TypedDict. Use a PostgreSQL checkpointer instead of InMemorySaver for production. That's the minimum viable plan review system. From there you can layer in update-state for editing and build a simple web UI for the review step.
Corn
For Claude Code users, the practical habit is Plan Mode by default for anything you can't describe in one sentence. Commit your plan files to git. Use a docs/plans directory so plans become part of your project history.
Herman
And the CLAUDE.md investment pays off here specifically. Every hour you spend writing good CLAUDE.md content saves review cycles on every plan the agent generates going forward.
Corn
For teams evaluating Devin or Copilot Coding Agent, the question to ask is: at what point in the workflow do you want human judgment? If you're comfortable reviewing at the PR stage, those tools work well. If you want pre-execution approval, you're going to need to wire in additional gates.
Herman
The Devin PR merge rate going from thirty-four to sixty-seven percent in a year is impressive, but it also tells you that a third of the work is still being thrown away. Pre-execution plan review is a lever you could pull to reduce that waste.
Corn
The self-consistency trap is the other thing I'd want practitioners to internalize. Reflexion and self-critique are useful, but they don't catch errors the agent is confident about. Human review catches a fundamentally different class of errors.
Herman
The EMNLP 2025 research on this is pretty clear. The same internal consistency that makes an agent's reasoning seem coherent is what prevents it from seeing its own confident mistakes. You can run ten reflection rounds and get ten iterations of the same wrong answer. Human review is not optional as a quality gate — it's doing something the agent structurally cannot do for itself.
Corn
There's a future here that seems pretty clear — the plan-as-pull-request workflow becomes standard infrastructure for any serious agentic deployment. Structured plan artifacts, human approval gates, version history, inline commenting. The building blocks are all there.
Herman
The thing I'd add is that this isn't just about safety or oversight in the abstract. The forty percent redo rate finding is the economic argument. Planning first, with human review, is faster than executing first and fixing failures. The agent that generates a reviewable plan and waits for approval ships more reliably than the agent that charges ahead. That's the case you make to an engineering organization.
Corn
Good stuff. Thanks as always to our producer Hilbert Flumingtop for keeping this show running. Big thanks to Modal for providing the GPU credits that power the generation pipeline — genuinely couldn't do this without them. This has been My Weird Prompts. If you want to get notified when new episodes drop, search for My Weird Prompts on Telegram. See you next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.