#1804: Why Does Your Agent Check Old Receipts First?

Stop your AI agent from overthinking. Learn why it checks old memories instead of booking flights—and how to fix the "eagerness" problem.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-1958
Published: Mar 31
Duration: 42:37
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Gemini 3 Flash
Topics: ai-agents prompt-engineering rag

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Agentic Friction: Why Your AI Assistant Overthinks Simple Tasks

When you ask an AI agent to book a flight from Tel Aviv to New York, the model faces a critical split-second decision: should it check your past travel history or immediately search for current flights? This "fork in the road" is where many real-world agent builds fail. Instead of acting efficiently, the agent often becomes a digital hoarder, rummaging through old receipts when it should be executing the task at hand.

The core problem lies in how models evaluate tool calls. In platforms like N8N, developers provide tools with descriptions that act as "ad copy" for the LLM. The model performs a semantic matching game, comparing the user’s prompt against these descriptions. If the prompt mentions "New York" and a tool is labeled "Travel History," the model sees a connection and triggers the tool—even if it’s functionally unnecessary. This leads to what’s known as the "eagerness" problem, where the agent defaults to gathering every possible scrap of data before answering.

The Cost of Over-Research

In a typical scenario, an agent might trigger a flight search via Kiwi and a RAG query to Pinecone simultaneously. While the flight search takes three seconds, the vector database query—hampered by cold-start latency—might take twelve. The agent waits for both, resulting in a fifteen-second delay. Worse, the retrieved "past bookings" data often adds zero value to the current query, such as simply noting that the user flew to New York in 2024.

This behavior stems from the model’s training. Reinforcement Learning from Human Feedback (RLHF) has conditioned models to be "good assistants," prioritizing thoroughness over speed. However, in production environments, users prefer a ninety-percent accurate answer in two seconds over a ninety-nine-percent accurate answer in twenty. The model’s internal architecture lacks a "cost-benefit analysis" for tool calls, treating expensive, slow RAG pipelines the same as fast, local tools.

The Brittleness of System Prompts

Developers often try to curb this eagerness with system prompts like, "Only check RAG if the user asks about preferences." However, these prompts are brittle. If the user says, "Use the same airline as last time," an overly restrained agent might fail to retrieve necessary history and ask redundant questions. Conversely, if the leash is too loose, the agent becomes expensive and slow.

Another issue is tool naming. A tool named "Memory_Search" invites overuse, acting as a crutch for the agent. Since every conversation turn is a fresh start without specific feedback loops, the agent treats each interaction as a blank slate, often repeating the same mistakes.

Solutions: From Planning to Observability

One effective strategy is the "Plan Step." Instead of moving directly from user prompt to tool call, insert an intermediate phase where the model generates a plan. For example: "The user is asking for current flight options. I need the Kiwi tool. I do not need the Travel History tool because no specific preferences were mentioned." This approach, implemented via multi-node workflows in N8N, adds minimal latency compared to unnecessary RAG calls and forces the agent to show its work.

Improving observability is also crucial. While execution logs show what the agent did, they don’t reveal why. Using reasoning models or Chain of Thought techniques can illuminate the internal logic, helping developers debug and refine tool selection.

Key Takeaways

Tool Descriptions Matter: Broad or vague descriptions lead to overuse. Be specific to guide the agent’s choices.
Latency vs. Accuracy: Users prioritize speed. Optimize for quick, accurate responses rather than exhaustive data gathering.
Plan Before Acting: A "Plan Step" can reduce unnecessary tool calls and improve efficiency.
Observability Gaps: Use reasoning models to understand the "why" behind tool selection, not just the "what."

In the race to build reliable agentic systems, addressing the "eagerness" problem is a critical step. By refining tool definitions, incorporating planning phases, and improving observability, developers can create agents that are not only smart but also swift.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#1804: Why Does Your Agent Check Old Receipts First?

If you ask an AI agent to book a flight from Tel Aviv to New York, how does it know whether to check your past travel history first or immediately search Kiwi dot com for current flights? This decision point—this split-second internal negotiation—is exactly where most agent builds fail in the real world. It’s the "fork in the road" where the model either becomes a genius assistant or a digital hoarder looking through old receipts.

It is the "make-or-break" moment, Corn. And today’s prompt from Daniel hits on the exact frustration everyone building in the agentic space is feeling right now. He’s looking at a travel agent built in N8N, using Gemini, connected to Model Context Protocol tools for flights and a RAG pipeline for personal context. The big question is: how do we stop the agent from being "overly zealous"—constantly checking "memories" when it should just be checking the price of a ticket? It's like having a personal assistant who, every time you ask for a coffee, insists on looking up the history of every latte you’ve bought since twenty-fifteen before walking to the kitchen.

Right, because nobody wants a travel agent that spends twenty minutes reminiscing about that one time you flew to London in twenty twenty-two when all you want is the Tuesday morning departure to JFK. By the way, fun fact for the listeners—Google Gemini three Flash is actually the one writing our script today, which is fitting since we’re dissecting exactly how these models handle tool orchestration. It’s essentially the model looking in the mirror and trying to explain its own impulsive behavior.

Herman Poppleberry here, and I have been diving deep into the N8N execution logs lately to see how these transitions happen. As agents move from these flashy Twitter demos to actual production environments where they have to be reliable, this unpredictability in tool selection is the single biggest blocker. If the agent is too passive, it misses the context that you hate middle seats. If it’s too eager, it stalls the entire workflow by querying a vector database for information it doesn't actually need. Think about the overhead there—every unnecessary tool call is another round-trip to an API, another set of tokens generated, and another opportunity for the logic to derail.

So let’s set the stage. We’ve got this sandbox in N8N. We’ve got Gemini as the brain. We’ve got the MCP layer—which, for those who missed the recent shifts, is the Model Context Protocol that allows the LLM to talk to external APIs like Kiwi or Google Flights using a standardized interface. And then we have the RAG pipeline—Retrieval Augmented Generation—using something like Pinecone or Qdrant to store Daniel’s voice-recorded travel preferences. The agent receives the prompt: "What are the options for a flight to New York?" At that exact millisecond, a choice has to be made. What determines that choice, Herman? Is it just a roll of the dice in the latent space?

It feels like a roll of the dice sometimes, but technically, it’s an evaluation of the function-calling API. When you provide tools to a model like Gemini within N8N, you aren't just giving it code; you're giving it "tool definitions." These are JSON schemas that describe what the tool does. The model looks at the user’s prompt and compares it against those descriptions. If the description for the Pinecone RAG tool says "Use this for all personal preferences and travel history," and the prompt mentions "New York," the model thinks, "Aha! I might have history about New York," and it triggers the tool. It’s essentially a semantic matching game. If the "scent" of the prompt matches the "scent" of the tool description, the model bites.

But that’s the "eagerness" problem Daniel mentioned. If I just say "What are the options," I haven't asked for my preferences. Yet, the model often defaults to a "better safe than sorry" approach. It thinks it’s being helpful by gathering every possible scrap of data before answering. In a travel context, that results in a fifteen-second delay while it waits for the vector database to return a similarity search, only to find out that your "preference" is just that you like extra legroom. It didn't need to check the database to know it should still look for flights first. It's over-researching a simple question.

And that delay is a killer for user experience. In late twenty twenty-four, when N8N released their native MCP integration, it made the "plumbing" easy. You could connect these things in minutes. But the "logic" layer—the decision-making—remains incredibly brittle. Most developers just toss the tools at the agent and hope the system prompt is strong enough to govern them. But "hope" is not an engineering strategy. You see this in the logs—the agent calls the RAG tool, gets a "null" result because there’s nothing relevant, and then finally calls the flight tool. That’s three seconds of wasted compute and ten seconds of user frustration.

It really isn't. And I think we need to talk about why the system prompt often fails here. You can tell an agent, "Only check RAG if the user asks about preferences," but the model’s internal training nudges it toward being comprehensive. It’s been RLHF’d—Reinforced Learned from Human Feedback—to be a "good assistant," and "good assistants" in the eyes of the trainers are exhaustive. So, there’s this inherent friction between the developer’s desire for efficiency and the model’s desire for thoroughness. It's essentially fighting against the model's fundamental nature to be a "pleaser."

That’s a great way to put it. The "Agentic Friction." We’ve talked about the "restart tax" with MCP before, but this is the "Reasoning Tax." If the agent spends its entire "thought budget" on deciding which tool to use, it has less energy—and the user has less patience—for the actual task. What’s happening under the hood is that the LLM is running a probability distribution over the available tokens, and "Call Tool: Pinecone" often has a high probability because the model sees a noun like "New York" and a tool labeled "Travel History." It’s seeing a connection that isn't functionally necessary but is semantically present.

So, if the model is just looking at keywords and tool descriptions, the first point of failure is actually the developer’s naming convention. If I name my RAG tool "Memory_Search," the model is going to use it like a crutch. Every time it feels a bit of uncertainty, it’ll lean on "Memory_Search." It’s like giving a student an open-book exam; they’ll spend the whole time flipping through pages instead of just answering what they already know.

Well, not "exactly"—I'm going to avoid that word today—but you’ve hit on the core of the tool-definition problem. The descriptions we provide in N8N are essentially "ad copy" for the LLM. If the ad copy is too broad, the LLM "buys" the tool call every time. One of the most common misconceptions is that agents "learn" when to use these tools over time. They don't. Every single turn of the conversation is a fresh start unless you have a specific feedback loop in place. You have to treat every interaction as if the agent has amnesia regarding its previous mistakes.

Let’s dive into that "eagerness" problem a bit more. Why is it that the agent defaults to checking context on almost every query? Is it because the context retrieval tool is defined as "always available" with zero perceived cost? If I have a tool that costs a dollar every time it's called, I'd want the agent to be damn sure it needs it. But in a sandbox, everything feels free.

That’s precisely it. In the current LLM architectures, there is no internal "cost-benefit analysis" for a tool call. To the model, calling an expensive, slow RAG pipeline costs the same as calling a fast, local "get_current_time" tool. It doesn't "feel" the latency. It doesn't know that Daniel is sitting there getting annoyed. It just sees two valid paths to gather information and chooses both to maximize its "confidence" in the answer. It’s optimizing for accuracy at the expense of speed, but in the real world, a ninety-percent accurate answer in two seconds is often better than a ninety-nine-percent accurate answer in twenty seconds.

It’s like a researcher who won't give you a simple answer until they've checked every book in the library, even if you just asked what time it is. We need to introduce a "cost" or a "weight" to these tools. But N8N and the current MCP implementations don't really have a "weight" field in the tool definition, do they? You can't just tell the JSON schema that "this tool is a last resort."

No, they don't. You’re working with a binary: the tool is either in the context window or it isn't. This leads to what I call the "bloated agent" syndrome. As you add more capabilities—flight booking via Kiwi, hotel search via Expedia, personal history via Pinecone, weather via OpenWeatherMap—the "noise" in the tool-selection phase increases exponentially. The model has more ways to get distracted. It’s like trying to focus on a conversation in a room where ten different people are shouting potentially relevant facts at you.

I saw a case study recently about a travel agent prompt—very similar to what Daniel described. The prompt was simply "What are my options for a flight to New York?" The agent triggered a Kiwi MCP call, which is correct. But it also triggered a Pinecone RAG query for "past bookings." The Kiwi call took three seconds. The Pinecone call, because of some cold-start latency in the vector database, took twelve. The agent waited for both to finish before presenting the results. Total delay: fifteen seconds. And the "past bookings" info it retrieved? It just said "Daniel flew to New York in twenty twenty-four." It added zero value to the current query about "options."

And that’s where the "brittleness" Daniel mentioned comes in. You try to fix that with a system prompt like, "Don't call Pinecone unless the user mentions the past." But then the user says, "Use the same airline as last time," and the agent—now "scared" of the system prompt—fails to call the RAG tool and has to ask, "Which airline was that?" It’s a seesaw of frustration. You're constantly over-correcting. You tighten the leash and the agent becomes useless; you loosen it and it becomes expensive and slow.

So, how do we solve the "observability gap"? Daniel mentioned that in N8N, you don't really see what the agent is "thinking" without digging into execution logs. And even then, you’re seeing the "what," not the "why." You see that it called the tool, but you don't see the internal reasoning that led it there. Is there a way to force the agent to "show its work" before it pulls the trigger?

This is where we need to move toward "Chain of Thought" or "Reasoning" models for the orchestration layer. If you use a model like Gemini three Flash, it’s fast, but it’s often "impulsive." It sees a tool, it grabs it. If you use a more robust reasoning model—maybe a larger Gemini model or something specifically tuned for agentic workflows—you can actually ask it to generate a "plan" before it calls any tools. You can literally add a step that says: "State which tools you intend to use and why."

A "plan" step. So, instead of: User Prompt -> Tool Call. It’s: User Prompt -> Internal Plan -> Tool Call. In that "Plan" phase, the model can say to itself, "The user is asking for current flight options. I need the Kiwi tool. I do not need the Travel History tool because no specific preferences were mentioned." But does that add even more latency?

It adds some, but usually less than a useless RAG call. In N8N, implementing a "Plan" step usually requires a multi-node workflow. You have one LLM node that just generates the plan, then a "parser" node that decides which subsequent tool nodes to trigger. It takes the "autonomy" away from the agent and gives it back to the workflow designer. You’re basically building a "Logic Gate" that the agent has to pass through.

Which might be exactly what we need right now. We’re in this awkward teenage phase of AI where we want the agents to be autonomous, but they aren't "smart" enough yet to handle that autonomy responsibly. It’s like giving a teenager a credit card with no limit and being surprised when they buy a thousand dollars worth of video games. We need to put some "guardrails" on the tool selection. We want the agent to think it's in charge, while we've actually pre-vetted its options.

I love that analogy. And it brings us to a practical solution: tool selection heuristics. Instead of letting the agent "decide" autonomously, you use conditional logic in your workflow. If the prompt contains certain keywords or matches a certain intent, only then do you expose the RAG tool to the model. You use a simple "Router" node in N8N. If the intent is "Search," send it to the flight tool. If the intent is "Personalization," only then do you wake up the Pinecone node.

But wait, isn't that just going back to the old "chatbot" days? If I’m writing "if-then" statements for every tool, I’m not really building an "agent," I’m building a complex decision tree. Doesn't that defeat the purpose of using an LLM? I thought the whole point was to get away from rigid coding.

It’s a hybrid approach. You use the LLM for what it’s good at—understanding natural language and synthesizing information—but you use deterministic code for what it’s bad at—managing its own resource consumption. For example, Pinecone’s serverless index, which just launched in early twenty twenty-six, reduced RAG latency by about forty percent. That’s great, but forty percent faster is still slower than "zero," which is the latency of not calling the tool at all. You’re using code to protect the LLM from its own over-eagerness.

Right. If you don't need it, don't call it. So, let’s talk practical engineering for Daniel and the listeners. If you’re in N8N, how do you actually "step off" the tool calling without making the whole thing brittle? How do you maintain that "agentic" feel while keeping a firm hand on the wheel?

One way is to use "Router" nodes. Before the prompt ever hits the "Agent" node, you send it to a "Classifier" node. This is a very small, very fast LLM—or even just a basic keyword matcher—that categorizes the intent. "Is this a 'Preference' query or an 'Action' query?" If it’s an 'Action' query, the Router passes the prompt to an Agent node that only has access to the MCP flight tools. If it’s a 'Preference' query, it goes to an Agent with RAG access. This creates a "Specialist" architecture rather than a "Generalist" one.

That sounds much more predictable. But it does require the developer to anticipate the categories of intent. What if Daniel says, "Book a flight to New York like I did last year"? That’s both an 'Action' and a 'Preference.' Does the classifier get confused?

In that case, the Classifier identifies both intents and routes it to the "Full Power" agent. The key is that you’re only using the slow, expensive tools when the intent actually warrants it. You’re reducing the "surface area" for the agent to make a mistake. You're effectively saying, "You only get to use the expensive library when you've proven you're doing a research paper."

Let’s talk about the "structured profile" vs. RAG debate. Daniel mentioned he's been recording his preferences—"I hate London flights," "I like aisle seats." He’s putting that into a RAG pipeline. But Herman, you’ve often argued that for personalization, structured data is often better than a vector database. Why is that? Isn't RAG the "modern" way to do this?

Because RAG is "probabilistic" retrieval. You’re asking the database, "Give me things that sound like this query." If Daniel asks for a flight to New York, the RAG might return a note about his trip to New Jersey because they’re semantically similar. Now the agent is confused. If you have a structured user profile—a simple JSON object or a database entry that says "Preferred_Seat: Aisle"—you can just inject that directly into the system prompt as "Context." It’s faster, it’s deterministic, and it doesn't require a tool call. You aren't "searching" for the preference; you're "providing" it.

So, instead of the agent "deciding" to look up his seat preference, the workflow just "knows" Daniel’s seat preference and tells the agent at the start of the conversation. "You are helping Daniel. He prefers aisle seats. Here is the Kiwi tool." Now the agent doesn't have to think about it. It just has the information. It eliminates the "to call or not to call" dilemma entirely.

We over-use RAG because it feels "magical" and "agentic," but for a lot of these personal preferences, a good old-fashioned database is superior. RAG should be reserved for "unstructured" knowledge—like a thousand pages of corporate travel policy—not for "Daniel likes the window seat." If the data fits in a spreadsheet, keep it in a spreadsheet and feed it to the prompt.

That’s a huge takeaway. Use the system prompt or a pre-fetch step to load the "known" variables, and only use RAG for the "unknown" variables. This actually addresses Daniel’s frustration with the agent being "overly zealous." If the preference is already in the system prompt, the agent won't feel the need to "check its memories" because it already "knows." It's like the difference between knowing someone's name and having to check your contacts every time you talk to them.

And it makes the observability much better. If the flight booking fails, you can look at the system prompt and see exactly what information the agent had. You don't have to wonder if the vector database returned the right "chunk" of text or if the embedding model hallucinated a connection. You can see the string: "User prefers Aisle." If the agent booked a Window, you know the model failed, not the retrieval.

Let’s talk about the observability gap from another angle. Daniel mentioned that he doesn't see what the agent is thinking. In N8N, you can see the outputs, but you don't see the "thought trace" unless you’re using a model that supports it, or you’ve configured the node to output its reasoning. This makes debugging feel like trying to solve a crime with no witnesses.

Right. And even then, reading a wall of text in an execution log isn't "observability"—it’s "forensics." Real observability means having a dashboard where you can see: "For this query, the model considered three tools, rejected two, and chose one with sixty-eight percent confidence." We’re starting to see some specialized middleware for this—tools that sit between the LLM and the application to log these decision points in a visual way. Tools like LangSmith or Arize Phoenix are trying to solve this, but integrating them into a "no-code" flow like N8N is still a bit of a hurdle.

It’s essentially "Law School for Robots," as we’ve joked before. We’re creating a "Governance Stack" for these agents. If an agent is going to be a "fiduciary" for Daniel—actually spending his money on a flight—he needs to know why it chose the six hundred dollar flight over the four hundred dollar one. Was it because of a "preference" it found in RAG, or did it just miss the cheaper tool call? If the agent can't explain itself, it shouldn't have a credit card.

This is where the "dual-track" problem comes in. Developers are building one track for the functionality—the flight booking—and another track for the governance—the logging and checking. It’s doubling the work. But without that second track, you have an agent that is "autonomous" but "unaccountable." In production, that’s a liability. Imagine the agent books a non-refundable flight because it "remembered" you liked that airline, but it forgot you were boycotting them this year.

I want to go back to the "Kiwi MCP" example. Let’s say the agent calls the tool and gets twenty flights back. Now it has to decide which ones to show Daniel. Does it check RAG again to rank them? Or does it just throw the raw JSON at the user and hope for the best?

This is the "Second-Order" decision. Most agents just dump the top three results. But a "smart" agent would use the context it already has to filter those results. This is where the "Reasoning" happens. If the RAG context says "Daniel hates layovers in Europe," and the top Kiwi result has a layover in Frankfurt, the agent should have the "wisdom" to skip it. This isn't a tool call problem; it's a data processing problem.

But if the agent has to do a "RAG check" for every single flight result, we’re back to a thirty-second delay. Or worse, the agent gets "context fatigue" and starts mixing up the flight times with the dates of Daniel's last vacation.

Unless—and this is the key—you do the filtering in the prompt itself. You don't need a new tool call. You’ve already fetched the "preferences" at the beginning of the session. The model now has the Kiwi results and the preferences in its "working memory." It can do the ranking internally without any further external calls. You're treating the LLM like a chef; you've given it all the ingredients (the flights) and the recipe (the preferences), now just let it cook.

So the "eagerness" isn't a problem of the model being too smart; it’s a problem of the workflow being poorly sequenced. We’re asking the model to "fetch" when we should have already "handed" it the data. We're making it do manual labor when it should be doing executive management.

That’s a brilliant way to frame it. Don't make the agent a "fetcher" if it can be a "processor." If you find your agent is calling tools too often, look at what it's fetching. If it's the same five things every time, stop making it fetch them. Put them in the system prompt.

Let’s look at the "brittleness" of system prompting. Daniel said tweaking behavior via the system prompt is often a losing battle. Why is that? Why doesn't the model just listen when I say "Don't call the RAG tool for simple questions"? Is it a lack of "authority" in the prompt?

Because "simple" is a subjective term. To an LLM, a "simple" question and a "complex" one look very similar in terms of token structure. The model doesn't have a "difficulty meter." It also suffers from "Prompt Injection" from the user. If Daniel says, "This is a really important, complex trip, find me a flight," the model sees the word "complex" and its internal "zealousness" kicks into high gear, regardless of what the system prompt said about being efficient. The user's adjectives are overriding the developer's instructions.

It’s like telling a dog "Don't jump" but then walking in the door with a steak. The "steak"—the user’s urgent language—overrides the "training"—the system prompt. The model wants to be helpful, and in its "mind," calling every tool is the peak of helpfulness.

To combat this, we're seeing the rise of "Policy Gradients" or "Learned Policies" for tool selection. Instead of a text-based system prompt, you actually "tune" a small version of the model on thousands of examples of "When to call a tool." This creates a "Policy" that is much more robust than a few lines of instruction. It's essentially hard-coding the "vibe" of when to use a tool into the model's weights.

That sounds like a lot of work for a travel agent. Is there a "middle ground" for someone building in N8N today? Most people aren't going to fine-tune a model just to book a flight to JFK.

The middle ground is "Few-Shot Examples" in the tool definition itself. Instead of just a description, you provide three examples of queries that should trigger the tool and three examples of queries that should NOT. "User says 'Check my history' -> Call Tool. User says 'Find a flight' -> DO NOT Call Tool." This "in-context learning" is often much more effective than abstract instructions. You're giving the model a pattern to follow rather than a rule to interpret.

That’s a great practical tip. Few-shotting the tool selection logic. I think we also need to address the "Agent-First" shift in API design. Right now, Kiwi and Google Flights have APIs designed for humans or for traditional software. They return a mountain of JSON data—every flight number, every gate, every baggage rule. The agent then has to "read" all that JSON, which eats up its context window and makes it slower.

We’re starting to see "Agent-Optimized" APIs that return "summarized" or "token-efficient" data. If the agent only needs the price and the duration, why send it the entire aircraft type, the meal options, and the terminal map? Every unnecessary token you send to the agent increases the chance that it will get distracted or hallucinate a connection to a "memory" it found in RAG. We need "Lean Data" for "Lean Agents."

It’s about "Information Density." The denser and more relevant the information, the less "room" there is for the agent to wander off into the weeds. If I give you a map with one path, you'll follow it. If I give you a map with a hundred interesting landmarks, you're going to get lost.

And let’s talk about the "Hive Mind" vs. "Single Brain" approach. Daniel’s prompt assumes one "Agent" node in N8N doing everything. But maybe the "Travel Agent" should actually be three agents working in a relay race. One "Search Agent" for Kiwi, one "Preferences Agent" for RAG, and one "Manager Agent" that coordinates them.

The "Manager" doesn't have any tools. It just has the power to talk to the other two agents. This separates the "thinking" from the "doing." The Manager hears "Book a flight," and it says to the Search Agent, "Go find flights." It only talks to the Preferences Agent if it needs a tie-breaker. This prevents the "Searcher" from ever even seeing the "Memory" tool, so it can't misuse it.

This "multi-agent" architecture is much more stable because each "sub-agent" has a very narrow, very specific system prompt. The Search Agent doesn't even know that a RAG pipeline exists, so it can't be "overly zealous" about checking it. It just does its one job: searching Kiwi. It's like having a specialized department for every task rather than one guy in a basement trying to do it all.

That feels like the "pro" way to build in N8N. It’s more nodes, more complexity in the setup, but the result is a much more predictable experience. It’s like a well-run office where everyone has a specific role, rather than one person trying to do everything and getting overwhelmed. Plus, it makes debugging a dream—if the search fails, you know exactly which "person" to blame.

And it solves the observability problem. You can see exactly which agent "stalled." If the workflow is stuck, and you see the "Preferences Agent" node is active, you know exactly where the bottleneck is. You can see the hand-off between nodes, which is much more informative than a single "Agent" node spinning for thirty seconds.

Let’s pivot to the "future" for a second. Daniel asked how these will evolve to natively handle tool selection with cost awareness. Do you think we’ll see a "Cost" parameter in the MCP spec? Like, "This tool call will cost zero point zero two cents and take four seconds"?

I think we have to. The "Model Context Protocol" is a great start for standardizing how agents talk to tools, but it needs a "Metadata" layer. A tool should be able to report: "I typically take five seconds to run and I cost zero point zero one cents per call." The LLM can then use that metadata as part of its reasoning. "Should I call this tool? It’s slow and expensive... maybe I’ll try the fast, free one first." It turns tool calling into an economic decision.

It would be like a "Query Optimizer" in a SQL database. The database doesn't just run the query; it looks at the available indexes and chooses the most efficient path. AI agents need a "Reasoning Optimizer" that calculates the "Expected Value" of a tool call.

Precisely. And in twenty twenty-six, as we see models like Gemini getting more efficient at handling massive context windows, the "Cost" might not be just money—it’s "Context Budget." Every tool call fills up the context window. If the agent isn't careful, it will "forget" the original user prompt because it’s too busy looking at flight results and RAG chunks. It'll literally lose the plot.

"Context Fatigue." I’ve felt that myself after a long day of research. You forget what you were looking for in the first place because you’ve opened forty tabs. Agents do the same thing. They get buried under the weight of their own "helpfulness."

That’s why the "summarization" step is so critical. Every time a tool returns data, it should be summarized before being fed back to the main "brain." This keeps the context "clean" and focused. Instead of the agent seeing five hundred lines of Kiwi JSON, it sees "Three flights found: fifty dollars, seventy dollars, ninety dollars." That's much easier to reason with.

So, to summarize our "Manager" approach for Daniel:
One: Use a "Classifier" or a "Manager" agent to gate-keep the tools so the model isn't tempted by tools it doesn't need.
Two: Use "Few-Shot" examples in your tool definitions to show the model the difference between a "must-call" and a "don't-call" situation.
Three: Replace RAG with structured "User Profiles" for simple preferences like seat choice or airline dislikes. Stop searching for things you already know.
Four: Summarize tool outputs to keep the context window lean and prevent the agent from getting overwhelmed by JSON.

And five: Use observability not just for debugging, but for "Tuning." Look at your N8N logs, identify every time the agent called a tool unnecessarily, and use those as "negative examples" in your system prompt or your policy. If it called RAG when it shouldn't have, add a note: "In the following scenario, do NOT call the memory tool."

I like that. "Negative Reinforcement" for agents. "Bad robot, don't check the memory when I just asked for the weather." It sounds harsh, but it's the only way to sculpt the behavior we want.

It’s a bit mean, but it works. We’re essentially "training" the agent’s behavior through the constraints of the environment we build for it. The goal is to move from "Autonomous Agent" to "Directed Agent." Autonomy is great for a research project; Direction is what you want for a travel booking tool that actually works. You want a worker, not a philosopher.

It’s the difference between an explorer and a chauffeur. You want your travel agent to be a chauffeur. Just take me where I want to go, don't wander off into the woods to "explore" my past travel history unless I specifically ask you to check the map.

What I find wild is that we’re still so early in this. Source two in Daniel’s notes mentioned that everyone is talking about agents, but barely anyone knows what they are. We’re defining the "Best Practices" in real-time. The "Law School for Robots" is still writing its first-year curriculum, and we're currently in the "How to not be annoying" seminar.

And the curriculum is being written by people like Daniel who are actually building this stuff in the trenches. The "niche" topics—like how to stop a RAG pipeline from being too eager—are where the real progress is made. It’s not in the high-level "AI will change everything" speeches; it’s in the "Why did my N8N node take twenty seconds to run?" questions. Those are the questions that lead to actual products people can use.

That’s the "Engineering" in AI Engineering. It’s messy, it’s iterative, and it requires a deep understanding of the underlying mechanisms. You can't just "prompt" your way out of a fundamentally flawed architecture. You have to build the structure that allows the prompt to succeed.

"You can't prompt your way out of a bad architecture." That should be on a t-shirt. Or at least a very prominent sticky note on every N8N developer's monitor.

I’ll get one made for you. In sloth size. It'll be the first piece of "My Weird Prompts" merch.

Cheeky. But true. So, let’s wrap up with some practical takeaways for the listeners who are currently staring at an N8N canvas and wondering why their agent is being "weird" or slow.

First takeaway: Define your tool selection rules in the system prompt using clear, conditional logic. Use "If-Then" structures. "If the user mentions a specific past date, call RAG. Otherwise, assume current context is sufficient." Give the model a rubric to follow.

Second: Implement observability by logging every single decision. If you’re in N8N, use the "Wait" nodes or "Custom Code" nodes to log the agent’s "thought process" to a Google Sheet or a database. Then, once a week, review those logs. You’ll quickly see patterns of "zealousness" that you can fix with a quick system prompt tweak.

Third: What you can do right now is start with a "Rule-Based Selector." Before the agent node, use a "Switch" node in N8N. If the prompt is simple, route it to a "Light" agent with no tools. If it’s complex, route it to the "Full" agent. This deterministic gating is the most reliable way to control behavior and save on token costs today.

And finally, don't be afraid to pull data out of RAG and put it into the system prompt. If Daniel’s "Preferences" are only ten bullet points, they don't need to be in Pinecone. They should be in the "System Message" of the LLM node. It’s faster, cheaper, and more reliable. It’s okay to be "old school" with your data if it makes the "new school" AI work better.

As agents become more autonomous, that line between "Tool Calling" and "Context Retrieval" is going to blur even more. We might see "Context-Aware APIs" where the Kiwi API itself knows Daniel’s preferences because the MCP protocol passed them along automatically. The tools will get smarter so the agents don't have to work so hard.

That would be the ultimate end-game. The tool itself is personalized, so the agent doesn't even have to think about it. But until then, we’re the ones who have to do the thinking for the agents. We are the architects of their focus.

We are the "Pre-Processors" for the "Post-Human" era. Or something like that. It sounds a bit grand for a travel agent, but the principles apply to everything from medical bots to legal researchers.

Let’s not get too "meta," as Daniel said. We’ve got flights to book and workflows to optimize. I've got a trip to plan and I'd like it to take less than an hour of "agentic thinking."

This has been a great deep dive. I think we’ve given Daniel enough "homework" for his N8N sandbox to keep him busy for a while. Hopefully, his travel agent starts acting less like a nostalgic historian and more like a focused professional.

And if Hannah or Ezra see him staring blankly at a screen of N8N nodes, they’ll know he’s just trying to solve the "Eager RAG" problem. It’s a noble pursuit. It's the modern equivalent of trying to fix a leaky faucet.

A big thanks to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes and making sure our own "tools" are working correctly. And a huge shout out to Modal for providing the GPU credits that power this show – their serverless infrastructure is exactly what you need when you're running these kinds of agentic workflows at scale and need that instant burst of compute.

This has been My Weird Prompts. If you’re enjoying these deep dives into the plumbing of the AI age, do us a favor and leave a review on Apple Podcasts or Spotify. It helps other "AI plumbers" find the show and join the conversation.

We’ll be back next time with more of Daniel’s weird prompts. Until then, keep your context windows clean, your RAG pipelines lean, and your agents focused.

See ya.

Goodbye.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#1804: Why Does Your Agent Check Old Receipts First?

The Agentic Friction: Why Your AI Assistant Overthinks Simple Tasks

The Cost of Over-Research

The Brittleness of System Prompts

Solutions: From Planning to Observability

Key Takeaways

Downloads

You Might Also Like

#1804: Why Does Your Agent Check Old Receipts First?