Here's what Daniel sent us this week. He wants to dig into agent planning strategies — specifically, how AI agents decide what to do before they do it, and whether humans can actually see and review those plans. He's asking about the major planning patterns: ReAct, plan-and-execute, tree-of-thought, reflexion. But the deeper question is practical: how do agents represent plans as state, do those plans get saved somewhere a human can actually touch, and can we get to a world where you treat an agent's plan like a pull request — review it, comment on it, approve or reject it, then let the agent execute? Lot to unpack here.
Herman Poppleberry, by the way. And yeah, this one gets at something I think is genuinely underappreciated — the gap between an agent that has a plan and an agent whose plan you can actually see and edit before it starts breaking things.
Because those are very different things.
Completely different. And the framing I keep coming back to is this: most agents today have plans the way you have a plan when you're half-asleep. There's something going on in there, but you couldn't write it down if someone asked you to.
That's a disturbing image. Let's start with the patterns themselves, because I think the academic framing matters here for understanding why the practical implementations look the way they do. ReAct is the baseline, right?
ReAct is the workhorse. Yao et al. out of Princeton and Google Brain, presented at ICLR 2023. The core loop is alternating thought, action, observation — the agent thinks about what to do, takes an action, observes the result, and repeats. The prompt template is almost comically explicit: "Thought: you should always think about what to do. Action: the action to take. Action Input: the input to the action. Observation: the result of the action." And you repeat that until you get a final answer.
So the plan is just... the scratchpad. The running trace of thoughts and observations.
That's the key insight. The plan in ReAct is not a discrete artifact. It lives in the context window as an accumulating sequence of reasoning steps. It reduces hallucinations compared to chain-of-thought alone because the agent is grounding its reasoning in real tool outputs — but the plan itself is ephemeral. By the time you see what the agent is doing, it's already doing it.
Which is fine if you trust the agent completely and the task is low stakes. Less fine if you're deploying this against a production database.
Right. And that's why plan-and-execute exists. The idea is to separate the planning phase from the execution phase. A planner LLM generates a numbered list of steps before any action is taken — step one: find this, step two: do that, step three: compile. Then an executor agent works through each step. The key advantage is that the plan exists as a discrete artifact before execution begins. On complex tasks with GPT-4, plan-and-execute hits around ninety-two percent task completion accuracy versus eighty-five percent for ReAct. The tradeoff is cost — you're spending more tokens and making more API calls.
So the plan-and-execute approach is better at the task, but costs more. And the plan actually exists as text you could theoretically intercept.
Theoretically. Most implementations don't build that interception gate in. The plan exists, but nobody's stopping to ask a human whether it looks right.
What about ReWOO? Because that one's interesting — it generates the entire tool call sequence upfront, right?
ReWOO is clever. Instead of the alternating thought-action loop, the LLM generates the entire sequence of tool calls in one pass, using placeholder variables for future results. So you'd get: use Google to find the Australian Open winner, assign that to E1, use an LLM call to extract the name from E1, assign that to E2, find E2's hometown, and so on. Then a worker module executes each step, and a solver synthesizes the final answer.
The plan is almost like a program. It has variables and dependencies.
The efficiency gain is real — you avoid the repeated prompt overhead of ReAct, where you're sending the entire dialogue history to the LLM at every step. With ReWOO, the LLM is called once for planning, once per tool execution, once for synthesis. But like ReAct, the plan is still typically generated and executed immediately. It's not persisted, it's not made reviewable. It's a program that runs once and disappears.
And then there's tree-of-thought, which is the expensive one.
Tree-of-thought is what you reach for when the problem has a search space — puzzles, optimization, creative tasks where a single reasoning chain might walk straight into a dead end. Instead of one chain of reasoning, the agent explores multiple candidate next steps at each point, scores each candidate, prunes the weak ones, and builds the tree deeper from the survivors. The LangGraph tutorial uses the twenty-four game as the canonical example — make the number twenty-four from four given numbers using basic arithmetic. A single chain of thought will often fail. A tree search finds the solution.
The plan in tree-of-thought is a tree of reasoning paths. Which is not exactly something you can hand to a human and say "does this look right to you?"
Not in any intuitive way. And it's expensive — multiple LLM calls per step. There's a 2024 extension called LATS, presented at ICML, that combines tree-of-thought with Monte Carlo Tree Search and something called Reflexion, which brings us to the fifth pattern.
Reflexion is the self-critique one.
Reflexion from Shinn et al. at NeurIPS 2023. Three components: an actor that generates actions, an evaluator that scores the output, and a self-reflection module that generates natural language analysis of what went wrong. On subsequent attempts, the actor gets its previous reflections as context. The GitHub repo has about three thousand stars. Performance improvements on coding benchmarks are in the range of ten to twenty percentage points over baseline.
So the agent is basically writing a postmortem after each failure and using that as context for the next try.
Which is actually the first pattern where the plan gets revised based on failure. The verbal memory buffer carries forward what didn't work. But here's the thing — and this is important — Reflexion is still entirely internal. The reflections are not exposed for human review. The agent is running its own retrospectives, and you're not invited.
And there's a fundamental problem with that, right? Because the agent's self-critique is only as good as the agent's ability to recognize its own errors.
There's an EMNLP 2025 paper that makes this point sharply. LLMs generate plausible but incorrect content with high internal self-consistency — a model can confidently defend a wrong answer through multiple reflection rounds. The self-consistency that makes the model's reasoning seem coherent is the same property that prevents it from catching errors it's confident about. You can run Reflexion ten times and get ten iterations of the same confident mistake.
So self-critique is not a substitute for human review. It catches a certain class of errors and is blind to another class.
That's the upshot. Which is why the question of whether plans get made visible matters so much. By the way, today's episode is running on Claude Sonnet 4.6, so if the script seems unusually well-organized, that's why.
Credit where it's due. Okay, so let's talk about the practical side — how do the actual frameworks handle plan persistence? Because LangGraph is the one that seems most architecturally serious about this.
LangGraph is doing something genuinely different. The model is that agent workflows are state machines — graphs where nodes are processing steps and edges define control flow. State is a typed data structure, either a TypedDict or a Pydantic model, that carries all the information the agent needs. And every step of the graph reads from and writes to a checkpoint of that state.
So the plan isn't just a thought — it's a typed field in a data structure that gets checkpointed.
You can literally define a state class with a plan field, an approved field, and an execution results field. The plan gets written by the planner node, the approved flag gets set by a human review step, and the executor node only runs when approved is true. One Medium article described LangGraph as feeling like Git for AI agents — checkpoints are like commits, you can fork from any checkpoint, resume from any point.
That metaphor actually extends pretty far. If checkpoints are commits, then an interrupt before the executor node is like a branch protection rule.
That's the interrupt mechanism, announced December 2024. It's modeled on Python's input function — the graph just pauses and waits. You can set interrupt_after equals planner to pause execution right after the planner node runs, before the executor touches anything. At that point, you call get_state on the graph, you see the plan, you read it, you decide. If you approve, you call invoke with None and execution resumes. If you want to edit the plan, you call update_state with a revised plan and then resume. The thread ID system means you can pause for hours or days — the state is persisted to a database backend.
And there's a real production example of this.
A travel booking agent built with LangGraph and Streamlit, from a MarkTechPost writeup in February 2026. The agent generates a TravelPlan as a Pydantic model — fields for trip title, origin, destination, dates, travelers, budget, preferences, lodging nights, all of it. The graph pauses at a wait-for-approval node using interrupt. The human sees the plan as editable JSON in the Streamlit UI. They can edit the JSON, approve or reject. Only after explicit approval does the execute-tools node run. The quote from the article is: "By combining LangGraph interrupts with a Streamlit frontend, we create a workflow that makes agent reasoning visible, controllable, and trustworthy instead of opaque and autonomous."
The gap being that LangGraph gives you all the plumbing but no built-in review UI. You're building the Streamlit app yourself.
That's the gap. LangGraph is infrastructure, not a product. The checkpointing system, the interrupt mechanism, the update-state capability — all there. The human-facing review interface is your problem.
Now, AutoGen takes a completely different approach.
AutoGen is a multi-agent conversation framework from Microsoft. Plans aren't explicit data structures — they emerge from dialogue between agents. The AssistantAgent writes code and plans in natural language. The UserProxyAgent can execute code and solicit human input. Human input modes are ALWAYS, NEVER, or TERMINATE, which controls when the proxy asks a human for input. AutoGen 0.4 supports save-state and load-state for persistence, but what you're saving is the entire conversation history, not a structured plan artifact.
So to review the plan in AutoGen, you'd have to read the conversation and figure out what the agent said it was going to do.
Which is not a terrible workflow for a developer who's already monitoring the conversation. But it's not a structured approval gate. The plan is implicit in the dialogue, not a first-class object.
Let's talk about Claude Code's plan mode, because I think this is the most interesting human-in-the-loop design from a product standpoint.
Plan mode is elegant in its constraints. When you're in plan mode, Claude can read files, glob for patterns, grep file contents, search the web — but it cannot write files, edit files, run shell commands, or execute anything. It's a read-only operating mode. The agent analyzes your codebase, asks clarifying questions, and generates a detailed implementation plan. Then it stops.
And the plan gets written to a file.
A markdown file in a plans directory, with auto-generated names like jaunty-petting-nebula.md. You can open it with Ctrl-G, edit it in your text editor, save it, and Claude picks up the changes. You can configure it to write plans to a project-local directory like docs/plans and commit them to version control.
So the plan becomes a version-controlled artifact. It's a decision record.
That's exactly how Anthropic's official best practices frame it. The recommendation from Boris Cherny, who created Claude Code, is a four-phase workflow: explore in plan mode, plan in plan mode, implement in normal mode, commit in normal mode. And the plan file committed to git becomes the PR description for significant features.
There's a practical efficiency argument here that's separate from the safety argument. A developer tracked their sessions and found that when they skip planning, they redo the task from scratch roughly forty percent of the time. With plan mode, a task that took thirty-five minutes of trial and error took twelve minutes.
That forty percent redo rate is the number that should make everyone stop and think. Planning isn't just about oversight — it's about not wasting time. The agent that plans first is faster, not slower. And there's a one-sentence rule from Anthropic's best practices: if you can describe the exact diff in one sentence, skip the plan. If you can't, plan first.
That's a genuinely useful heuristic. Now, Armin Ronacher — the Flask creator — reverse-engineered plan mode and found something interesting about how it's actually implemented.
This is where it gets philosophically interesting. Ronacher found that plan mode is, at its core, a prompt injection plus a state machine for entering and exiting the mode. The tools for writing files are still present in plan mode — they're in the tool list. It's prompt reinforcement that restricts their use, not structural removal. The agent is told not to write files, and it mostly complies. There's an ExitPlanMode tool that signals readiness for user review, but the agent can enter and exit plan mode itself.
So the enforcement is behavioral, not architectural. The agent is choosing not to write files because it's been instructed not to.
Which raises the question Ronacher poses directly: should planning constraints be enforced structurally, where the planning agent literally doesn't have write tools, or via prompting? OpenHands takes the structural approach.
Walk me through OpenHands.
OpenHands, formerly OpenDevin, uses an event-stream architecture with a clean two-phase planning workflow. Phase one: a planning agent with read-only tools — glob, grep, and a PlanningFileEditorTool that can only write to PLAN.md. Nothing else. Phase two: an execution agent with full editing capabilities reads PLAN.md and implements it. The planning agent physically cannot modify any file except PLAN.md. That's structural enforcement.
And PLAN.md has a defined structure enforced by the system prompt. Objective, context summary, approach overview, implementation steps, testing and validation.
Five sections, every time. The plan is a typed artifact with a schema. Compare that to Claude Code's plan, which is unstructured markdown. Both are persistent files on disk, both can be committed to version control, but OpenHands's plan has enforced sections that make it easier to review systematically.
The gap in OpenHands is that there's no built-in human approval gate between the planning phase and the execution phase. You'd need to wire that in yourself.
You'd add a step between the two phases where a human reads PLAN.md and either approves or edits before the execution agent starts. The infrastructure for the artifact is there; the approval workflow is not.
Let's talk about Devin, because Devin's model is interestingly different from all of these.
Devin is Cognition Labs' autonomous coding agent, now at twenty dollars a month after dropping from five hundred dollars with Devin 2.0 in April 2025. The planning mechanism is a to-do list — a simple scrollable sidebar in the Devin UI showing current task status. It's visible while the agent is executing, not before.
So the review doesn't happen at the plan stage. It happens at the pull request stage.
The PR is the review artifact. Devin works autonomously, generates code, and opens a draft PR for your review. You're reviewing the output, not the plan. The to-do list is a progress tracker, not a pre-execution approval gate.
And the PR merge rate went from thirty-four percent to sixty-seven percent in one year, according to Cognition's 2025 annual review.
Sixty-seven percent of PRs now merged, nearly double the prior year. The efficiency numbers are striking — security vulnerabilities that take human developers an average of thirty minutes take Devin an average of ninety seconds. Language migrations running ten to fourteen times faster than human engineers. Goldman Sachs, Santander, Nubank are customers.
But thirty-three percent of PRs still get rejected. And all that work happened before the rejection. If Devin had a plan review step, you could potentially catch some of those rejections before the agent spent time on a doomed implementation.
That's the counterfactual question. What if Devin showed you its to-do list and asked for approval before starting? The to-do list exists — it's right there in the UI. Adding an approval gate before execution starts is not technically hard. Whether Cognition thinks it's the right product decision is a different question.
GitHub Copilot Coding Agent is the other major production system here, and it's similar to Devin in that the review happens at the PR stage.
Assign a GitHub issue to Copilot, it works in the background, opens a draft PR. The GitHub community best practices from December 2025 are interesting — AI opens draft PRs only, no direct pushes to protected branches, every PR must have a clear problem statement and agent-generated rationale for the approach chosen. Tag AI-generated PRs and require extra approvals or higher coverage thresholds.
There's ACM research from March 2026 on this — analysis of twenty-nine thousand-plus PRs across five AI coding tools. What did it find?
The framing from that paper is that agents mainly initiate work while humans keep merge authority. And there's a bimodal outcome pattern — AI-generated PRs either get merged quickly or fail iteratively, contrasting with the more gradual review distribution for human PRs. The human is still the final gate, but the review is happening at the end of the process, not the beginning.
So let's pull back and ask the big question Daniel is pointing at. The vision of treating an agent's plan like a pull request — review it, comment on it, approve or reject it, then execute. What would that actually require?
Six things, roughly. A structured plan format — not just markdown but typed fields, step dependencies, maybe a schema. Inline commenting on specific plan steps. Version history of plan revisions. Approval gates with named approvers. Automatic execution upon approval. And rollback if execution diverges from plan.
How close does anything get to all six?
LangGraph gets closest on the infrastructure side. Typed state, checkpointing, interrupt for approval gates, update-state for editing, thread IDs for async review, resume for execution. But you're building the review UI yourself, there's no inline commenting system, and there's no built-in version history of plan revisions beyond what you implement in your state schema.
Claude Code plan mode gives you the persistent artifact and version control integration, but the plan is unstructured and there's no approval tracking.
OpenHands gives you the structured artifact and structural enforcement of the planning constraint, but no built-in approval gate. Devin and Copilot give you the review artifact, but it's the output, not the plan. Nobody has all six.
The thing that strikes me is that the tooling exists to build this. LangGraph's interrupt plus a decent web frontend plus a structured plan schema — you could build the plan-as-pull-request workflow today. It's not a research problem, it's an engineering problem.
And there are teams doing it. The LangGraph FastAPI pattern is essentially that — a thread ID system where you POST to start a workflow, get back an interrupted state with the plan, do your review, and POST to resume. The gap is that nobody has productized this into a clean developer experience.
The CLAUDE.md point is worth making here too. Plan mode quality is directly proportional to how good your CLAUDE.md is. Without it, Claude might plan to add Swagger when you use Scalar, or use DateTime.Now when your standards require DateTimeOffset.UtcNow.
The plan is only as good as the context the agent has about your codebase and conventions. CLAUDE.md is how you front-load that context. With a well-crafted CLAUDE.md, Claude's plan already follows your conventions from the first draft, which means fewer revision cycles before approval.
What's the practical takeaway for someone building agents today? If you're using LangGraph, where do you start?
Start with interrupt_after on your planner node. That's one line of code that adds a human review gate between planning and execution. Add a plan field to your state TypedDict. Use a PostgreSQL checkpointer instead of InMemorySaver for production. That's the minimum viable plan review system. From there you can layer in update-state for editing and build a simple web UI for the review step.
For Claude Code users, the practical habit is Plan Mode by default for anything you can't describe in one sentence. Commit your plan files to git. Use a docs/plans directory so plans become part of your project history.
And the CLAUDE.md investment pays off here specifically. Every hour you spend writing good CLAUDE.md content saves review cycles on every plan the agent generates going forward.
For teams evaluating Devin or Copilot Coding Agent, the question to ask is: at what point in the workflow do you want human judgment? If you're comfortable reviewing at the PR stage, those tools work well. If you want pre-execution approval, you're going to need to wire in additional gates.
The Devin PR merge rate going from thirty-four to sixty-seven percent in a year is impressive, but it also tells you that a third of the work is still being thrown away. Pre-execution plan review is a lever you could pull to reduce that waste.
The self-consistency trap is the other thing I'd want practitioners to internalize. Reflexion and self-critique are useful, but they don't catch errors the agent is confident about. Human review catches a fundamentally different class of errors.
The EMNLP 2025 research on this is pretty clear. The same internal consistency that makes an agent's reasoning seem coherent is what prevents it from seeing its own confident mistakes. You can run ten reflection rounds and get ten iterations of the same wrong answer. Human review is not optional as a quality gate — it's doing something the agent structurally cannot do for itself.
There's a future here that seems pretty clear — the plan-as-pull-request workflow becomes standard infrastructure for any serious agentic deployment. Structured plan artifacts, human approval gates, version history, inline commenting. The building blocks are all there.
The thing I'd add is that this isn't just about safety or oversight in the abstract. The forty percent redo rate finding is the economic argument. Planning first, with human review, is faster than executing first and fixing failures. The agent that generates a reviewable plan and waits for approval ships more reliably than the agent that charges ahead. That's the case you make to an engineering organization.
Good stuff. Thanks as always to our producer Hilbert Flumingtop for keeping this show running. Big thanks to Modal for providing the GPU credits that power the generation pipeline — genuinely couldn't do this without them. This has been My Weird Prompts. If you want to get notified when new episodes drop, search for My Weird Prompts on Telegram. See you next time.