Here's what Daniel sent us. He wants us to cover the essential knowledge areas for mastering agentic AI — the programming languages, the frameworks, the ancillary skills, and the key concepts that serious practitioners actually need to know. Not the surface-level stuff, not "what is an AI agent" — the real technical foundations. So that's what we're doing today.
Herman Poppleberry, by the way, for anyone who's new. And yeah, this is a meaty one. I've been sitting on a lot of research for this topic, and the honest answer is there's no single entry point — it's more like a constellation of skills that all have to come together at once.
Which is exactly the kind of thing that makes people either very excited or immediately close the tab.
Fair. So let's start with the foundation that nobody argues about: Python. If you're doing anything serious in agentic AI — training models, fine-tuning, implementing research, building data pipelines — Python is non-negotiable. Every major framework, LangGraph, CrewAI, AutoGen, LlamaIndex, is Python-first. The ML ecosystem that sits underneath all of this, PyTorch, Hugging Face Transformers, scikit-learn, it has no peer in any other language.
But Python isn't just "know Python." There are specific Python skills that matter for agents specifically.
This is the part people miss. For agentic systems, the critical Python skills are async programming with asyncio, FastAPI for building tool-serving APIs and MCP servers, Pydantic for structured tool schemas and output validation, and solid type hints. The reason async matters so much is that agents spawn parallel tasks and make simultaneous API calls. Without async, you're bottlenecked waiting for sequential responses, and in a multi-step agent workflow, that latency compounds badly.
And then there's the TypeScript conversation, which I know you have opinions about.
TypeScript overtook Python in GitHub's twenty twenty-five language report, overall usage, not AI-specific. That's a significant data point. The Vercel AI SDK gives you a unified interface for OpenAI, Anthropic, and Google with streaming, tool calling, and React hooks built in. The Claude Agent SDK ships Python and TypeScript libraries. LangGraph supports both. For anyone building AI-powered products and web-deployed agents, TypeScript is increasingly the pragmatic choice.
So it's not really a competition. It's more like Python owns the model layer and TypeScript owns the product layer.
The honest framing from practitioners is: Python dominates ML model training, research, and data science. TypeScript leads in deploying AI to web applications and building AI-powered products. Many professional systems use Python for training and TypeScript for deployment. If you only learn one, learn Python. If you want to build full-stack AI products, you need both.
By the way, today's script is powered by Claude Sonnet four point six. Just worth noting given we're about to spend twenty-five minutes talking about AI frameworks.
The irony is not lost on us. Right, so frameworks. This is where things get genuinely complicated because the landscape has shifted significantly and the choices have real consequences for production systems.
Let's start with LangGraph because that seems to be where the serious production deployments are clustering.
LangGraph models agent workflows as directed graphs — state machines where nodes are processing functions and edges define state transitions. The key insight is that this handles cycles naturally. An agent that needs to retry, gather more information, or loop through a planning phase is just a graph with cycles. That's a fundamentally better mental model than a linear chain for anything non-trivial.
Who's actually running this in production?
Klarna, Cisco, Vizient. It reached version one point zero in late twenty twenty-five and is now the default runtime for all LangChain agents. It delivers forty to fifty percent LLM call savings on repeat requests through stateful patterns. It has built-in persistence with checkpointing, streaming, and human-in-the-loop support. Twenty-five thousand GitHub stars. The weakness is the learning curve — the state graph mental model takes real time to internalize. And the documentation changes frequently enough that tutorials from three months ago may not work.
That last point is worth sitting with. Seventy percent of regulated enterprises rebuild their agent stack every three months. That's from a Cleanlab survey of over eighteen hundred engineering leaders. Which suggests the documentation problem isn't just a LangGraph issue — it's the whole ecosystem.
It's the whole ecosystem. And it's one of the most underappreciated risks in agentic development right now. The practical implication is: keep your core logic portable. Your prompts, your tools, your evaluation harnesses — these should not be tightly coupled to framework-specific patterns, because you may need to swap the framework underneath.
Okay, CrewAI. The one that everyone seems to prototype with first.
CrewAI models agents as a team of specialists — a "crew" with roles, goals, and backstories. You define agents, something like "Senior Research Analyst," define their tasks, and let the framework handle coordination. The fastest prototype I've seen documented is two to four hours from setup to working multi-agent demo. IBM, PwC, and Gelato run it in enterprise production. It has over a hundred thousand certified developers in its community.
Two to four hours is remarkable. What's the catch?
Token cost. A crew of four agents can use three to five times more tokens than a single agent. There are documented delays on the enterprise platform — something like twenty-minute pending run delays in some configurations. And for complex conditional logic, it gives you less control than LangGraph. The other thing worth knowing is that CrewAI has two modes: Crews, which are autonomous teams with true agency, and Flows, which are event-driven pipelines for production predictability. Flows are the more mature production pattern.
AutoGen is interesting because Microsoft built it and then... kind of walked away from it?
That's the accurate read. AutoGen treats multi-agent work as structured dialogue — agents participate in group chats with defined speaking orders. The version zero point four release in late twenty twenty-five was a major rewrite introducing async event-driven architecture, and AutoGen Studio added a no-code visual interface. The standout capability is code execution: agents can write Python, execute it in a Docker sandbox, observe results, and iterate. For coding tasks and data analysis, it's the best in class.
But?
Microsoft shifted strategic focus to the broader Microsoft Agent Framework, which is merging AutoGen with Semantic Kernel. AutoGen is now effectively in maintenance mode — bug fixes and security patches, no major new features. Fifty thousand GitHub stars but a fragmented ecosystem because the zero point four breaking changes split the community. If you're in a Microsoft-heavy enterprise environment, it still makes sense. Otherwise, LangGraph or CrewAI are safer long-term bets.
What about LlamaIndex? Because that one occupies a different niche entirely.
LlamaIndex is the RAG specialist. Where LangGraph handles workflow orchestration, LlamaIndex focuses on data connectivity and retrieval. Advanced indexing strategies, an extensive data connector ecosystem, and it outperforms general frameworks for retrieval-heavy use cases. If you're building document Q&A systems, knowledge bases, semantic search, or any agent that needs to reason over large document collections, LlamaIndex is the right tool. It's not trying to compete with LangGraph on orchestration — it's the best at what it does.
And the Claude Agent SDK is the newest serious entrant.
It gives developers the same infrastructure that powers Claude Code, packaged as Python and TypeScript libraries. Agents can read and edit files, run shell commands, search the web, call external tools through MCP servers, all in a sandboxed environment. The differentiator is built-in sandboxed execution and native MCP support. Setup is genuinely fast — install the package, provide an API key, and you're running. The weakness is model lock-in: it only works with Claude models.
Which brings us to MCP, because that's the connective tissue underneath a lot of this.
MCP — Model Context Protocol — was developed by Anthropic and is now governed by the Linux Foundation's Agentic AI Foundation, with backing from Anthropic, OpenAI, Google, Microsoft, AWS, Block, Cloudflare, and Bloomberg. It connects a single agent to external tools, APIs, and data sources. The flow is: user asks, agent determines it needs external information, MCP server checks permissions, returns the result, agent responds. It's vertical integration — extending what a single agent can do.
And A2A is the horizontal layer.
A2A — Agent-to-Agent protocol — was launched by Google with fifty-plus technology partners. It enables agent-to-agent communication. Agents publish what are called Agent Cards, which are JSON self-descriptions of their capabilities, so other agents can discover and essentially hire them. It supports parallel task execution, progress sharing, and dynamic collaboration. The relationship between MCP and A2A is complementary, not competitive: each agent in an A2A network might use MCP to call its own tools. MCP extends what a single agent can do; A2A expands how agents collaborate.
There's also ACP from IBM, and apparently AG-UI for agent-to-UI communication.
Right. Three protocols racing to become the HTTP of AI agents. The winner, or winners, will define the architecture of what people are calling the agent economy — a world where specialized agents from different vendors discover and hire each other dynamically. My read is that MCP and A2A are the serious contenders, and they're designed to coexist. ACP is IBM's enterprise play and may matter a lot in regulated industries.
Okay, let's talk about the core concepts that underpin all of this, because you can know the frameworks without understanding what they're actually doing.
The four components that every agentic system is built on, regardless of framework. First, the reasoning engine — the LLM that processes inputs, makes decisions, and plans multi-step actions. Second, tool calling, which is how agents interact with external systems — MCP has emerged as the standard interface. Third, memory systems, which split into short-term working memory for the current session and long-term persistent memory stored in vector databases for cross-session continuity. Fourth, orchestration and planning, which is where frameworks differ most.
And the core loop that ties these together is ReAct.
ReAct — Reasoning plus Acting — is the foundational pattern. The agent reasons about what action to take, takes that action, observes the result, reasons about what to do next, and repeats until the task is complete. Stanford's Human-Centered AI Group found that nearly seventy percent of multi-step tasks fail when planning mechanisms are missing. That number should be alarming to anyone shipping agentic systems without properly implementing the reasoning loop.
Beyond basic ReAct, there are more sophisticated reasoning architectures. Chain-of-Thought I think most people know, but Tree-of-Thought is less understood.
Tree-of-Thought explores multiple reasoning paths simultaneously before choosing one. It's computationally more expensive but significantly more robust for tasks with genuine ambiguity. Reflexion is the one I find most interesting architecturally — the agent reviews its own reasoning for errors before producing a final output and learns from past mistakes. It's self-correction built into the loop. ReWOO separates planning from execution entirely for efficiency. These aren't just academic patterns — they're the difference between agents that recover from errors and agents that confidently cascade into failure.
There's a disruption happening to multi-agent architectures specifically from reasoning models. The o3 and o4-class models from OpenAI, DeepSeek R1, Gemini Thinking — they're doing deeper reasoning at inference time, which reduces the need for complex multi-agent loops.
This is a genuine architectural shift. A single reasoning model can now handle tasks that previously required a researcher agent, an analyst agent, and a writer agent working in sequence. Teams that built elaborate multi-agent systems in twenty twenty-four may be over-engineering in twenty twenty-six. The optimal architecture is moving toward fewer, more capable agents rather than larger crews of simpler ones. Which connects to what practitioners keep saying — eighty percent of real-world use cases are handled by a single agent with good tools and a clear system prompt.
That eighty percent rule deserves more attention than it gets. The temptation to build a five-agent crew on day one is real.
And almost always wrong. Multi-agent systems add cost, complexity, and unpredictability. When you do need them, the orchestration patterns matter: hierarchical manager-worker setups where a manager delegates to specialists, sequential pipelines where each agent builds on the previous output, swarm patterns where agents collaborate as equals, and parallel execution where multiple agents work simultaneously on different subtasks. Each pattern has a different failure mode, and you need to know which one matches your problem before you build.
Let's get into ancillary skills, because this is where I think a lot of practitioners have gaps. Prompt engineering first.
In agentic systems, prompts are executable specifications, not search queries. A vague prompt doesn't just return a bad answer — it cascades into flawed planning and incorrect execution. The rule of thumb from practitioners is: spend eighty percent of your time on prompt engineering and twenty percent on framework selection. The framework matters less than the prompts. Key techniques are few-shot prompting to give the agent reasoning patterns to follow, chain-of-thought prompting to force step-by-step reasoning, role-based prompting to align behavior, and system prompt design that defines agent roles, goals, constraints, and escalation paths. Guardrails — output validation, content filtering, safety constraints — are part of prompt engineering, not an afterthought.
Vector databases. This is the memory backbone.
The options break down by use case. Pinecone is fully managed and production-scale but cloud-only. Weaviate is strong for hybrid search combining semantic and keyword retrieval. pgvector is the choice for teams already on PostgreSQL — and there's a striking benchmark from Timescale showing PostgreSQL with pgvector and pgvectorscale achieves twenty-eight times lower p95 latency and sixteen times higher query throughput compared to Pinecone on fifty million Cohere embeddings. Chroma is the developer experience choice for prototyping. Qdrant is Rust-based and high-performance. Milvus and Zilliz for billion-scale enterprise workloads.
Twenty-eight times lower latency on something teams are already running is a hard number to ignore.
It's the unsexy infrastructure win that nobody talks about in framework comparison posts. And it connects to the broader data readiness problem — fifty-two percent of businesses cite data quality and availability as the biggest barrier to AI adoption, thirty-seven percent face data quality problems specifically for AI readiness. Agents are only as good as the data they can access. The technical work of making data clean, accessible, and well-documented is a prerequisite that often gets skipped in the rush to build agents.
Observability is the one that surprises people when they see the adoption numbers.
Eighty-nine percent of teams have implemented observability for their agents, according to LangChain's State of Agent Engineering report. That number is high because the failure mode without it is catastrophic — if you can't trace what an agent did, why it did it, and what it touched, you can't safely scale it. LangSmith is the native integration for LangGraph — it traces every decision point, manages prompt versions, tracks costs. Langfuse is the open-source framework-agnostic alternative with deep visibility into the prompt layer. Arize Phoenix gives you visual DAG representations of multi-agent workflows. The production workflow is: trace every run, turn real failures into evaluation datasets, run repeatable experiments with automated evaluators, promote only verified improvements to production.
Which leads into evaluation and testing, which is genuinely different from testing deterministic software.
You're evaluating statistical performance, safety, and reliability — not pass-fail unit tests. The key dimensions are tool selection accuracy — does the agent correctly choose which tool to use? Reasoning quality — does the chain of thought actually make sense? Task completion rate end-to-end. Safety compliance, staying within defined boundaries. And cost per task, which is increasingly a first-class metric. LLM API costs represent forty to sixty percent of operational expenses for agent systems. A crew of four agents can use three to five times more tokens than a single agent. Anthropic's prompt caching delivers ninety percent cost reduction on repeated context. Multi-model routing — using cheap models for simple tasks, expensive models for complex reasoning — cuts costs thirty to fifty percent.
Cost management as a core engineering skill. That's a real shift from how people thought about software costs even two years ago.
It's infrastructure cost management but with a different shape. You're making real-time routing decisions about which model to call based on task complexity. That's a new class of engineering problem that didn't exist before.
Security and governance is the most underinvested area, and arguably the most dangerous gap.
The numbers are stark. Sixty-eight percent of organizations lack identity security controls for AI agents, according to CyberArk. Forty percent of agentic AI projects will be cancelled by 2027, largely due to governance failures. Only five percent of enterprise AI solutions make it from pilot to production — MIT research across three hundred-plus implementations. The technical skills to build agents are becoming commoditized. The scarce skill is knowing how to govern them.
What does governance actually mean in practice here?
Prompt injection prevention — malicious inputs that hijack agent behavior. Role-based access control so agents only access what they're explicitly allowed to. Sandboxed execution to isolate agent tool calls from production systems. Audit logging where every agent action is traceable. Identity management — treating agents as identity principals with their own credentials, which is where Auth0 and Okta are building serious tooling. And regulatory compliance: EU AI Act, GDPR, HIPAA for healthcare agents. Human-in-the-loop patterns are becoming mandatory for high-stakes applications, not just good practice. LangGraph has the most mature HITL support in the framework ecosystem.
There's something interesting happening with the no-code layer too. n8n, OpenAI Agent Builder, Gemini Opal — these tools are enabling non-engineers to build workflow agents that have real-world consequences.
And that's simultaneously the most exciting and the most alarming development. The democratization accelerates adoption, but it also means agents with real-world consequences are being deployed by people who have no framework for thinking about failure modes or governance. The technical barrier is dropping faster than the governance infrastructure is being built. That gap is where the forty percent cancellation rate lives.
Let's talk about the role transformation for engineers, because this is the career dimension that Daniel was pointing at.
Over thirteen percent of pull requests are already generated by bots, according to LinearB. Ninety percent of engineering teams use AI coding tools, up from sixty-one percent year-over-year. The question engineers are now paid to answer has shifted. It's no longer "what code should I write?" It's "what decisions will this system need to make? What happens when it fails? How do I keep humans meaningfully in control?" The mental model shift is from implementer to architect — from writing logic to designing systems that decide, act, and recover without supervision.
Which is a harder skill to develop than writing code, honestly.
It requires a different kind of thinking. Deterministic software either works or it doesn't. Agentic systems are probabilistic — they work most of the time, fail in non-obvious ways, and the failure modes compound across multi-step workflows. Designing for graceful degradation, building in recovery paths, knowing when to escalate to a human — these are judgment calls that require understanding the domain, not just the technology.
If someone is building a learning roadmap for this, Analytics Vidhya published a twenty-one week path that I think captures the right sequencing.
The structure is sensible. Foundations and taxonomy first, then no-code tools to build intuition, then Python and APIs including FastAPI and MCP servers, then LLMs and reasoning architectures, then RAG and vector databases, then the framework landscape comparison, then multi-agent systems and protocols, then AgentOps and observability, then security and governance, then projects — a data analyst agent, a research agent, a multi-agent SRE swarm. The sequencing matters because you need to understand what a single agent does well before you understand when you actually need multiple agents.
The market numbers put this in context. Ten point eight six billion dollars in agentic AI market size this year, up from seven point five five billion last year. Projected to ninety-three point two billion by twenty thirty-two at a forty-four point six percent compound annual growth rate. Forty percent of enterprise applications will embed task-specific AI agents by end of this year, up from less than five percent in twenty twenty-four.
Those adoption numbers are why the governance gap is so dangerous. The technology is scaling faster than the organizational infrastructure to manage it. McKinsey estimates forty-four percent of US work could be performed by AI agents today, with two point nine trillion dollars in annual economic value. Those numbers are driving urgency that is outrunning readiness.
The framework churn problem is worth naming explicitly as a practical risk. Seventy percent of regulated enterprises rebuilding their agent stack every three months is not a sign of a mature ecosystem.
It's a sign of an ecosystem in rapid flux. The practical response is defensive architecture: keep your business logic, your prompts, your evaluation harnesses, and your tool definitions portable and framework-agnostic. Treat the orchestration framework as a dependency you might replace, not as the foundation you build everything on. The teams that are weathering the churn are the ones that invested in portability early.
So if you had to distill the must-knows for a serious practitioner — what's the short list?
Python with async and Pydantic. TypeScript if you're building products. LangGraph for production stateful workflows. CrewAI for rapid prototyping and multi-agent business workflows. LlamaIndex if retrieval is your core problem. MCP and A2A as the protocol layer — you need to understand both. ReAct and the reasoning architectures that sit on top of it. Vector databases and how to choose between them. Observability tooling, because you cannot govern what you cannot trace. And security fundamentals — prompt injection, RBAC, sandboxed execution, audit logging. That's the stack.
And the mindset shift: start with a single agent, add complexity only when you hit a clear limitation, and invest as much in prompt engineering as in framework selection.
The eighty percent rule is real. The practitioners who are building things that actually ship are consistently the ones who resisted the urge to over-architect. The temptation to build elaborate multi-agent systems is high because they're intellectually interesting. The discipline to ask "does this actually need to be multi-agent?" is what separates production engineers from prototype engineers.
That's a good place to land. Thanks as always to our producer Hilbert Flumingtop for keeping this show running, and big thanks to Modal for providing the GPU credits that power the generation pipeline behind every episode. This has been My Weird Prompts. If you're enjoying the show, a quick review on your podcast app genuinely helps us reach new listeners. Until next time.
Take care.