Daniel sent us this one — he wants us to walk through the LangChain dot A I slash deep agents repository. This is the open source agent harness that's been getting a lot of attention, and I think the interesting thing here is that it's not just another framework. It's batteries-included, it's opinionated, and it ships with its own terminal-based coding agent. There's a lot to get into.
Also, fun fact — DeepSeek V four Pro is writing our script today. So if anything sounds unusually coherent, that's why.
I was going to say, you sound suspiciously organized this morning.
I'll take that as a compliment. But yes, let's get into this. The deep agents repo is a monorepo, and I want to start with what that actually means structurally, because it tells you a lot about the project's ambition. You've got the core S D K in libs slash deep agents, you've got a full terminal-based C L I in libs slash C L I, there's an Agent Context Protocol package, a whole evaluation suite, partner integrations for sandbox providers like Daytona and Modal — and then a separate R E P L package, plus a Quick J S integration.
This isn't a weekend project. Seven hundred fifty-four files, about seven point eight million tokens when you count it all up. The C L I alone has a full Textual-based terminal interface, a web front end, deploy commands, all of it.
The thing to understand is that the core value proposition is "batteries included." The readme says it right up front — instead of wiring up prompts, tools, and context management yourself, you get a working agent immediately. One import, one function call, and you're running.
I want to pause on that phrase "batteries included," because every framework in this space claims it, and most of them mean "we included some example code you can copy." What deep agents actually ships is planning, filesystem operations, shell access, sub-agents with isolated context windows, automatic context summarization, and prompts that teach the model how to use all of that effectively. That's not a starter kit. That's a finished product.
It's built on LangGraph, which is their production runtime with streaming, persistence, and checkpointing. So you get all of that infrastructure for free. The agent graph is compiled, you can use it with LangGraph Studio, you can add checkpointers — it's not a toy.
Let's talk about what's actually in the box. You call create underscore deep underscore agent, and what do you get?
The planning layer is the first thing I want to highlight. The agent gets a write underscore todos tool that lets it break down tasks and track progress. This matters more than people realize, because one of the biggest failure modes for autonomous agents is losing track of what they're doing halfway through a multi-step task. The todos tool gives the model external memory for its own plan.
It's visible to the user too, right? You can see what the agent thinks it's doing.
The C L I renders it in a dedicated panel. Then you've got the filesystem tools — read file, write file, edit file, L S, glob, grep. These are the primitives that let the agent use files as external memory. Large outputs get saved to files instead of being crammed into context. The agent can search its own work.
The shell access is where things get interesting from a security perspective. The readme has this line that I appreciate: "Deep Agents follows a trust the L L M model. The agent can do anything its tools allow. Enforce boundaries at the tool slash sandbox level, not by expecting the model to self-police.
That's refreshingly honest. Most projects in this space hand-wave about prompt-based safety and then act surprised when the model does something unexpected. Deep agents says: sandbox it properly, don't ask the model nicely. The C L I has a shell allow list system, and the S D K supports remote sandboxes through partners like Daytona, Modal, and Runloop.
You can run the agent in a container, and if it tries something outside the allow list, it just fails. The guardrail is structural, not rhetorical.
Then there's the sub-agents system. This is where the architecture gets genuinely clever. The task tool lets the main agent delegate work to sub-agents that have isolated context windows. So you can spin up a sub-agent to research one topic, another to write code, another to review — and they don't pollute each other's context.
Which solves the context window problem in a way that's more elegant than just summarization. You don't need to compress everything if you can farm out work to agents that each only see what they need.
The async sub-agent pattern takes this further. There's a whole example in the repo — async subagent server — that shows how to host a research agent as a Fast A P I server implementing the Agent Protocol, and then connect to it from a supervisor agent. The supervisor delegates a research task, gets a task I D back immediately, and polls for results later.
You're not blocking the main agent while the research runs. That's a real production pattern.
The protocol endpoints are standard — create thread, create run, poll run status, fetch thread state, cancel run. Any agent that speaks Agent Protocol can be plugged in as an async sub-agent. The example even shows how to swap in your own agent: just replace the create underscore deep underscore agent call in server dot py.
Let's talk about the C L I, because that's the part most people will actually interact with. It's a terminal-based coding agent, similar to Claude Code or Cursor, but it's powered by any L L M — not tied to one provider.
The install is a one-liner curl command. You run it and you get a rich Textual-based terminal interface with streaming responses, web search, headless mode for scripting and C I, remote sandboxes, persistent memory, custom skills, and human-in-the-loop approval. It's a full product.
The C L I ships with a deploy command. You can take an agent configuration, bundle it up, and deploy it as a web application. The repo includes several deploy examples — a coding agent, a content writer with per-user memory and Supabase auth, a docs research agent that uses M C P tools, and a G T M strategy agent that coordinates sync and async sub-agents.
The deploy examples are where you see the "batteries included" philosophy really shine. Each one has a deep agents dot T O M L config file, an agents dot M D file for persistent memory, and a skills directory. You configure what the agent knows and what it can do, and the framework handles the rest.
Let's talk about skills, because that's a core abstraction in this project. A skill is a directory containing a S K I L L dot M D file that describes a capability, and optionally scripts or other resources. The agent loads skills at startup, and they become part of its system prompt.
The examples directory has a great illustration of this. The content builder agent has skills for blog posts and social media. The deploy coding agent has skills for code review, coding preferences, and planning. The Nvidia deep agent has skills for C U D F analytics, C U M L machine learning, data visualization, and G P U document processing.
Skills are how you teach the agent domain-specific knowledge without having to modify the framework. You just drop in a skill directory and the agent knows how to do the thing.
Skills are portable. The downloading agents example in the repo shows that agents are literally just folders — you can zip one up, download it, unzip it, and run it. The agent is the configuration.
That's a subtle but important point. In most frameworks, the agent is code you write. In deep agents, the agent is configuration plus skills — declarative, not imperative.
The memory system is worth digging into as well. There's an agents dot M D file that persists across sessions. The C L I can cache agent memory in GitHub Actions across workflow runs, keyed by agent name and memory scope — per P R, per branch, or per repo.
If you're running a coding agent in C I, it can remember what it did last time and build on that. The action dot Y M L in the repo root defines the entire GitHub Actions integration — you just add a step that calls deep agents with a prompt, and it handles model selection, A P I keys, skills installation, memory caching, and timeout management.
The action is a composite action, so it's not a Docker container — it runs directly in the workflow. It installs the C L I with U V X, optionally clones a skills repo, restores memory from cache, runs the agent, and saves memory back to cache. All configurable through standard action inputs.
I want to talk about the evaluation framework, because that's where you see whether this is a serious engineering project or a demo. The evals directory is substantial — it's a full evaluation suite with its own test infrastructure.
The eval catalog covers file operations, follow-up quality, human-in-the-loop, memory, multi-turn memory, skills, sub-agents, summarization, system prompts, todos, tool selection, and tool usage. They've got external benchmarks too — B F C L, Frames, Nexus. And there's a Harbor integration for running evals against sandboxed agents.
The memory agent bench and the tau two airline benchmark are particularly interesting. The airline benchmark simulates a customer service agent handling flight bookings against a policy document and a database. The agent has to follow business rules while being helpful — exactly the kind of real-world task where most agents fall apart.
The eval infrastructure generates radar charts and model comparison reports. There are scripts for analyzing failures, generating the eval catalog, and running Harbor-based benchmarks. This is not a "we ran it a few times and it seemed fine" situation. This is systematic evaluation.
The partners directory is another signal of maturity. They've got integrations with Daytona, Modal, Runloop, and a Quick J S R E P L. Each partner package is independently versioned with its own changelog and release cycle.
The Quick J S integration is especially interesting. It gives the agent a JavaScript R E P L inside its sandbox, with foreign function interfaces so the agent can call Python tools from JavaScript and vice versa. The R E P L swarm example shows a skill module that dispatches sub-agents in parallel from inside the Quick J S R E P L.
You've got an agent writing JavaScript that spawns sub-agents. The recursion possibilities are either exciting or terrifying, depending on your disposition.
The R L M agent example takes this even further. It wraps create underscore deep underscore agent with a recursive R E P L plus a parallel task chain sub-agent chain — so you get fan-out across multiple levels. A top-level agent spawns sub-agents, and those sub-agents can spawn their own sub-agents.
Which is powerful, but I'd want to see the token costs on that before deploying it to production.
But the architecture is what matters — it shows that the sub-agent system is composable. You can nest agents arbitrarily, and the context isolation means each level only sees what it needs.
Let's talk about model support. The repo is aggressively provider-agnostic. The S D K works with any L L M that supports tool calling. The C L I auto-detects providers from model name prefixes — Claude, G P T, Gemini — and falls back to checking which A P I keys are set.
The harness profiles in the S D K are worth looking at. There are pre-configured profiles for Anthropic's Haiku four point five, Opus four point seven, and Sonnet four point six, plus OpenAI Codex. These profiles encode things like token limits, tool-calling behavior, and recommended settings for each model.
The provider profiles handle the differences between OpenAI, OpenRouter, and Anthropic A P I formats. The framework abstracts that away so you can swap models without changing your agent code.
The configuration system is flexible too. You can configure models through a deep agents dot T O M L file, through environment variables, or programmatically. The C L I has a model selector widget in the terminal interface, so you can switch models mid-session.
I want to highlight something in the contributing guide — the A G E N T S dot M D file at the repo root. It's a development guide for anyone working on the project, and it's thorough. Conventional commits with required scopes, pre-commit hooks for formatting and linting, a release-please pipeline that auto-publishes to PyPI, and detailed instructions for adding new model providers or partner integrations.
The P R labeling system is automated — it classifies pull requests by size, file changes, title format, and contributor tier. There's a labeler config that maps scopes to labels and file paths to scopes. The C I pipeline runs unit tests, integration tests, benchmarks, and linting across all packages.
The release process is fully automated. When a conventional commit lands on main, release-please creates a release P R with version bumps and changelog entries. Merging that P R triggers the release pipeline — build, test against the built package, publish to Test PyPI, publish to PyPI via trusted publishing, create a GitHub release.
The security model is documented honestly. The threat model file in the C L I says the agent can do anything its tools allow. The shell allow list is the primary boundary. There's a dangerous patterns test that checks for things like eval, exec, and pickle on user input.
The unicode security module in the C L I is a nice touch — it checks for homoglyph attacks and other unicode-based injection vectors. Someone thought about the fact that L L M output is untrusted input from a security perspective.
Let's spend some time on the examples directory, because that's where you see what people are actually building with this. The deep research agent is a multi-step web research system. It uses Tavily for U R L discovery, spawns parallel sub-agents to research different angles, and then does a strategic reflection step to synthesize findings.
It's not just "search and summarize." It's "search, fan out to parallel researchers, reflect on what you found, identify gaps, search again." That's a real research workflow.
The text-to-S Q L agent is a natural language to S Q L system with planning and skill-based workflows. It has a schema exploration skill that lets the agent discover the database structure, and a query writing skill with best practices for S Q L generation.
The Nvidia deep agent example is interesting because it shows multi-model orchestration. It uses Nvidia's Nemotron Super for research tasks and then switches to G P U-accelerated code execution via R A P I D S for data processing. The skills include C U D F analytics, C U M L machine learning, and G P U document processing.
The Ralph mode example is an autonomous looping pattern. The agent runs in a loop with fresh context each iteration, using the filesystem for persistence between loops. It's like giving the agent a notebook it can read from and write to across multiple "sessions.
The better harness example is maybe the most meta thing in the repo. It's an eval-driven outer-loop optimization of a deep agents harness. So you're using an agent to improve the harness that another agent runs on. It's agents all the way down.
Which brings us to the philosophical question at the heart of this project. The readme acknowledges that deep agents was primarily inspired by Claude Code — it started as an attempt to understand what made Claude Code general-purpose, and then make it even more so. But it's not a clone. It's a generalization.
The key difference is provider agnosticism. Claude Code is tied to Anthropic's models. Deep agents works with anything that can do tool calling. That's a meaningful architectural choice — it says the agent pattern is bigger than any one model provider.
It's M I T licensed. Fully open source. No strings attached. You can fork it, modify it, deploy it, sell it. The only requirement is keeping the copyright notice.
I think the most underappreciated aspect of this project is the C L I's attention to detail. The Textual interface has a model selector, a theme selector, a notification center, a thread picker, a sub-agent activity panel, a todos panel, file panels, diff rendering, tool call cards with specialized renderers for file operations, search results, and thinking steps.
The C L I also has a non-interactive mode for scripting. You can pipe a prompt to it and get the response back. That's how the GitHub Action works — it runs the agent headless with a prompt and captures the output.
The M C P support means the agent can connect to external tools through the Model Context Protocol. The deploy M C P docs agent example shows an agent that uses M C P tools to search LangChain documentation. The M C P configuration is in a dot M C P dot json file — you just point the agent at an M C P server and it discovers the available tools.
The context management system deserves more attention. When conversations get long, the agent auto-summarizes to stay within token limits. Large outputs from tools get saved to files instead of being stuffed into the context window. The agent learns to use its own filesystem as external memory.
The sub-agent system compounds this. Each sub-agent gets a fresh context window. The main agent only sees the final result, not the entire chain of thought. So you can have a sub-agent do a hundred steps of reasoning, and the main agent just gets the conclusion.
The middleware architecture is what makes all of this composable. The S D K has middleware for filesystem access, memory, skills, sub-agents, async sub-agents, tool call patching, summarization, and tool exclusion. You can mix and match these to build exactly the agent you want.
The backend system abstracts where the agent runs. There are backends for local filesystem, state-based storage, LangSmith sandboxes, composite backends that combine multiple storage layers, and a store backend for key-value persistence. The agent doesn't know or care where its files are — it just calls read file and write file.
I want to mention the async sub-agent server example in more detail, because it's a pattern I think we'll see a lot of. You run a Fast A P I server that wraps a deep agent. The server implements the Agent Protocol — it creates threads, runs agents, and returns results. Then any other deep agent can connect to it as an async sub-agent.
You could have a fleet of specialized agents running on different infrastructure — a research agent on one server, a coding agent on another, a data analysis agent on a third — and a supervisor agent that coordinates them all. The supervisor doesn't need to know how they work. It just needs their U R L and a description of what they do.
The supervisor example in the repo shows exactly this pattern. It connects to a researcher at localhost port twenty twenty-four, delegates research tasks, and polls for results. The supervisor's system prompt tells it when to launch the researcher, when to check status, when to update tasks, and when to cancel them.
What I find striking is how little code this requires. The server is about three hundred lines of Python, including the database layer. The supervisor is about a hundred and fifty lines. The framework does the heavy lifting.
That's the promise of "batteries included." You're not writing agent infrastructure. You're configuring agent behavior.
Let's talk about what's not in the repo. The digest doesn't show any built-in vector store integration. There's no mention of R A G in the readme. The memory system is file-based and session-based, not embedding-based.
That's a deliberate choice, I think. The philosophy is that the filesystem is the universal memory interface. You can implement R A G on top of it — just write documents to files and use grep or glob to find them — but the framework doesn't prescribe a specific retrieval architecture.
It also means the agent works without any infrastructure beyond a Python environment. No vector database, no embedding service, no external dependencies except the L L M A P I. That makes it trivial to deploy and test.
The trade-off is that for very large knowledge bases, file-based retrieval won't scale the way a vector store would. But that's a problem you can solve by adding a tool. The framework doesn't stop you from connecting to Pinecone or Weaviate — it just doesn't force you to.
I think the most honest thing in the whole repo is the A G E N T S dot M D file's section on security: "Deep Agents follows a trust the L L M model. The agent can do anything its tools allow." That's not a disclaimer — it's a design principle. The agent is not constrained by prompts. It's constrained by what it can actually do.
The corollary is that if you give an agent shell access without sandboxing, it can do anything you can do from a terminal. The framework makes this clear and gives you the tools to sandbox it properly. Whether you use them is up to you.
The C L I's threat model document goes into more detail. It covers the attack surface — L L M output is untrusted, tool outputs are untrusted, user input is untrusted. The allow list system, the unicode security checks, the dangerous patterns detection — these are all layers of defense.
I want to circle back to something we mentioned earlier — the fact that deep agents started as an attempt to understand Claude Code. There's a humility in that. The readme doesn't say "we built a better Claude Code." It says "we tried to understand what made it work and generalize it.
The result is something that's both more flexible and more opinionated. More flexible because it works with any model. More opinionated because it makes choices about how agents should work — they should plan with todos, use files for memory, delegate to sub-agents, and summarize when context gets long.
The examples directory shows the range. You've got research agents, coding agents, content writers, S Q L generators, G T M strategists, Nvidia G P U agents, recursive R E P L agents, swarm agents, looping agents. All built on the same primitives.
The downloading agents example makes the point that agents are just folders. You zip one up, share it, and someone else can run it with a single command. That's a different model from "clone my repo and install these seventeen dependencies.
The C L I's deploy command takes that further. You configure an agent, run deep agents deploy, and you get a hosted web application with auth, persistence, and a front end. The deploy examples include a content writer with per-user memory stored in Supabase.
The front end is built in React with TypeScript, bundled with Vite, and included in the C L I package. When you deploy an agent, the front end is served from the same process. It's a complete product.
I think the evaluation framework is the part that gives me the most confidence in this project. It's not just that they have tests — it's that they have systematic evaluations across multiple dimensions. Memory, tool selection, sub-agent coordination, summarization quality, human-in-the-loop workflows. These are the things that actually matter for agent reliability.
The eval infrastructure is designed to be run against different models. The model groups configuration lets you compare how Claude Sonnet performs versus G P T four O versus Gemini on the same tasks. The radar charts make it easy to see where each model excels and where it struggles.
The Harbor integration means you can run evals in sandboxed environments that match production. The agent runs in a container, the eval checks its behavior, and you get a report. That's real C I for A I agents.
One thing I notice is that the repo is actively maintained. The release manifest shows the S D K at version zero point five point four, the C L I at zero point zero point forty-four, the A C P package at zero point zero point six. These are moving targets, not abandonware.
The contributor guide is written for people who aren't already on the team. It explains the monorepo structure, the development tools, the commit conventions, the P R process, the release pipeline. It's an invitation to contribute.
Which matters for an open source project. If the only people who can contribute are the ones who already understand the codebase, the project doesn't grow.
The pre-commit hooks are worth mentioning because they enforce quality at commit time. Format and lint for each package, lock file consistency checks, version equality checks between pyproject dot T O M L and the underscore version dot py files. You can't accidentally commit something that breaks the build.
The C I pipeline runs on every P R. Linting, type checking, unit tests, integration tests, benchmarks. The integration tests spin up actual sandboxes and run agents against them. The benchmarks track startup performance and import times.
The release process is fully automated from commit to PyPI. Conventional commits trigger release-please, which creates a release P R with version bumps. Merging the P R publishes to PyPI and creates a GitHub release. No manual steps, no forgotten version bumps.
I think if there's one thing I'd want to see more of, it's documentation on failure modes. The eval framework tests for them, but there's no "here's what goes wrong and how to fix it" guide. Every agent framework has sharp edges, and knowing where they are before you cut yourself is valuable.
That's fair. The threat model document covers security failure modes, but operational failure modes — what happens when a sub-agent gets stuck in a loop, how to recover from a summarization that lost important context, when to increase the context window versus spawn a sub-agent — those aren't documented yet.
Though to be fair, some of that is inherent to working with L L Ms. No framework can prevent a model from hallucinating or getting confused. The best you can do is give it tools to recover.
Deep agents does that better than most. The todos system helps the agent notice when it's off track. The filesystem gives it a way to checkpoint its work. The sub-agent system lets it isolate failures. It's defense in depth for agent reliability.
To summarize what we've walked through: deep agents is a monorepo containing an agent S D K, a terminal-based coding agent C L I, an Agent Context Protocol package, an evaluation suite, and multiple partner integrations. It's batteries-included, provider-agnostic, M I T licensed, and built on LangGraph. The core primitives are planning, filesystem access, shell access with sandboxing, sub-agents with context isolation, and automatic context management. The C L I is a full product with a Textual terminal interface, deploy commands, GitHub Actions integration, and persistent memory. The examples show research agents, coding agents, content writers, S Q L generators, and multi-model orchestrations. The evaluation framework systematically tests memory, tool use, sub-agent coordination, and summarization across multiple models.
The philosophy is "trust the L L M, constrain the tools." Don't ask the model to behave — give it a sandbox where misbehavior is contained. It's a practical, engineering-first approach to a problem that a lot of people are still approaching as a prompt engineering challenge.
One open question I have is how this evolves as models get better at long-context reasoning. If context windows keep growing, does the sub-agent pattern become less necessary? Or does it remain useful for organizational reasons — keeping concerns separated even when you don't have to?
I think sub-agents remain useful regardless of context window size. It's not just about fitting everything in context — it's about cognitive organization. A single agent trying to do research, write code, review output, and manage a deployment is going to get confused no matter how big its context window is. Delegation is a software architecture pattern, not a workaround for token limits.
We've covered a lot of ground. The repo is at github dot com slash langchain hyphen A I slash deep agents, and it's worth exploring if you're building anything with agents.
And now: Hilbert's daily fun fact.
Hilbert: The national animal of Scotland is the unicorn. It has been since the twelfth century, when it was adopted as a symbol of purity and power by William the First.
...right.
Thanks to Hilbert Flumingtop for producing. This has been My Weird Prompts. Find us at myweirdprompts dot com.
See you next time.