#2461: How Claude Code's Conversation Compaction Actually Works

The three-tier system, what survives, what dies, and why you shouldn't rely on auto-compact.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2619
Published: Apr 26
Duration: 25:59
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: large-language-models ai-agents prompt-engineering

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

How Claude Code's Conversation Compaction Actually Works

Claude Code's conversation compaction is often misunderstood as a single feature, but it's actually a three-tier system designed to save tokens without sacrificing too much context. Understanding how it works under the hood — and its critical limitations — is essential for anyone using AI coding tools seriously.

The Three-Tier Architecture

Compaction isn't one mechanism. It's three, with cheaper options tried first:

Tool result trimming — Replaces old tool outputs with placeholder text saying "old tool result content cleared."
Cache-friendly prefix preservation — Uses structured information already in session memory.
LLM-generated summary — The full summarization call that requires a separate model inference.

Most auto-compactions never reach layer three. They're handled by Session Memory Compact, which uses pre-existing structured data. No extra model call needed. The system only spins up a separate inference when cheaper options aren't sufficient.

Trigger Conditions

Manual compaction fires immediately via /compact (or /co). The entire conversation history gets sent to a summarization call, a summary comes back, and a compact boundary is inserted — everything before that boundary is retained in memory but no longer included in future prompts.

Auto-compaction kicks in at roughly 95% context capacity (the effective window minus about 13,000 tokens). There's also a passive fallback: if the API returns a prompt-too-long error, the system initiates reactive compression and retries. However, if auto-compaction fails three times in a row, it pauses to prevent an infinite loop. A known bug can cause the internal compaction process to hang indefinitely and burn through quota — the only fix is manual interrupt.

The Separate Model Call

When compaction requires an LLM summary, it's an additional sampling step that counts against rate limits and bills. The API returns usage broken into compaction iterations and message iterations separately.

In server-side compaction through the Messages API, the same model specified for your request handles summarization — there's no option to use a cheaper model. The Claude Agent Python SDK, however, does allow specifying a different model through its compaction control parameter. The trade-off is clear: a cheaper model saves money but risks summary quality degradation.

The Nine-Section Structured Prompt

Claude Code's summarization prompt is remarkably specific — it demands a nine-section structured summary capturing:

User intent as a direct quote
Core technical concepts
Files and code of interest
Errors encountered and how they were fixed
The problem-solving logic chain
A summary of all user messages
TODO items
What's currently being worked on
Suggested next steps

The demand for direct quotes rather than paraphrasing is a deliberate design choice to prevent context drift. Subtle meaning shifts accumulate over multiple compactions when models paraphrase. Direct quotes keep the summary anchored to original language — like taking verbatim meeting notes versus writing from memory hours later.

What Survives and What Dies

Preserved: CLAUDE.md (loaded as part of the system prompt, untouched by compaction), system prompts, tool definitions, MCP instructions, working directory state (all re-declared after compaction), current task and immediate context, recently modified file names, recent errors and solutions, general project architecture. Tool call structures survive — the fact that a search happened — but actual results get replaced with placeholders.

Lost: Instructions from session start ("don't touch this file," "use this format"), intermediate decisions (why you chose approach A over B), specific code snippets discussed fifty messages ago, subtle style rules (no emoji, no Co-Authored-By in commits).

The asymmetry is critical: compaction reliably preserves what to do next but systematically drops why we did what we did. The agent can continue executing correctly but loses the ability to explain its own reasoning or adapt when changing requirements invalidate earlier assumptions.

The Reconstruction Phase

After compaction, Claude Code does three things: injects a lead-in message saying "this session continues from a previous conversation, here's a summary," automatically re-reads recently edited files (up to five files, 50,000 token budget, 5,000 per file), and re-declares all tool and skill definitions. If you see Claude Code suddenly re-reading files after a compaction event, that's the reconstruction phase doing its job — the system knows verbatim file contents were among the first casualties.

Token Savings and Power User Strategies

Official benchmarks show processing five support tickets with 35 tool calls went from 208,838 tokens down to 86,446 — a 58.6% reduction with just two compaction events. Token count directly ties to both cost and latency.

The single most important piece of power user advice: treat the conversation as volatile working memory and CLAUDE.md as persistent storage. Everything the agent must always remember — formatting rules, files never to modify, commit message conventions — should live in CLAUDE.md, which loads as part of the system prompt outside the conversation history entirely. Compaction doesn't know it exists.

Power users also exploit the compactPrompt setting to override the default compaction prompt with custom instructions. But this carries risk: a custom prompt completely replaces the default nine-section engineering, and a bad prompt can produce worse results.

The Philosophical Layer

An AI writing a summary of its own conversation so it can keep talking to itself is a form of synthetic episodic memory. The nine-section structured prompt is the model constructing its own memory. But is the post-compaction agent the same agent, or a new instance reading a briefing about what a previous instance did? The verbatim history is gone from the prompt. What remains is a summary written by the model about itself. Codex CLI calls its compaction output a "handoff memo," making the discontinuity explicit. Claude Code's approach implies continuity — but the identity question remains genuinely open.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2461: How Claude Code's Conversation Compaction Actually Works

Daniel sent us this one — he wants us to walk through how Claude Code's conversation compaction actually works under the hood. The trigger conditions, the separate model call it makes, the structured prompt, the in-memory swap, what survives versus what gets lost, and the practical implications. And he wants us to chew on the trade-off between aggressive and conservative summarization. There's also this wonderfully meta layer of an AI summarizing its own conversation history so it can keep talking to itself. So where do we even start with this?

I want to start with the thing most people get wrong about compaction, which is that they think it's just one thing. It's actually a three-tier system. Layer one is tool result trimming — just replacing old tool outputs with placeholder text. Layer two is cache-friendly prefix preservation. Layer three is the actual LLM-generated summary. Most auto-compactions never even reach layer three. They get handled by what's called Session Memory Compact, which uses structured information already sitting in session memory. No extra model call needed.

The system tries cheaper options first before it spins up a whole separate inference. That's actually sensible engineering. What triggers it in the first place?

Manual is straightforward — you type slash compact, or even slash co as a shortcut, and it fires immediately. The whole conversation history gets sent off to a summarization call, a summary comes back, and a compact boundary gets inserted. Everything before that boundary is retained in memory but no longer included in future prompts. Only the summary moves forward.

Auto-compact kicks in at roughly ninety-five percent context capacity. The exact threshold is the effective context window minus about thirteen thousand tokens. There's also a passive fallback — if the API returns a prompt-too-long error, the system initiates reactive compression and retries. But here's the thing — if auto-compaction fails three times in a row, it pauses to prevent an infinite loop. There's a known bug where the internal compaction process can hang indefinitely and just burn through your quota. The only fix is a manual interrupt.

Relying on auto-compact is a bad idea.

Steve Kinney, who teaches a course on AI development tools, put it bluntly — users should not rely on auto-compact because it can cause the agent to lose important context and spiral out of control. Manual compaction with focused instructions is much safer.

Which brings us to the separate model call. This is where it gets interesting. When compaction does require an LLM summary, it's an additional sampling step. It counts against your rate limits and your bill. The API returns detailed usage broken into compaction iterations and message iterations separately.

This is something I didn't realize until I dug into the documentation — in server-side compaction through the Messages API, the same model you specified for your request gets used for the summarization. There's no option to use a cheaper model for the summary. But the Claude Agent Python SDK does allow specifying a different model for compaction through its compaction control parameter.

So if you're using the API directly, you're paying full price for the summary call. If you're using the SDK, you could theoretically point compaction at a lighter model. Though I'd worry about summary quality dropping if you cheap out there.

That's the core tension of the whole feature. The structured summarization prompt is where the real engineering sophistication lives. In Claude Code specifically, the prompt demands a nine-section structured summary. It's not just saying summarize this conversation. It's asking the model to capture user intent as a direct quote, core technical concepts, files and code of interest, errors encountered and how they were fixed, the problem-solving logic chain, a summary of all user messages, TODO items, what's currently being worked on, and suggested next steps.

That's remarkably specific. And it demands direct quotes rather than paraphrasing?

That's a deliberate design choice to prevent what they call context drift. If the model paraphrases, subtle meaning shifts can accumulate over multiple compactions. By demanding direct quotes of key phrases, the summary stays anchored to the original language. It's like the difference between taking verbatim notes in a meeting versus writing a summary from memory three hours later.

Though even verbatim quotes pulled out of context can mislead. If I quote you saying this is a disaster but drop the preceding three sentences where you were describing someone else's code, the summary now has you panicking about your own work.

That's the lossy compression problem in a nutshell. And it's why the API-level default prompt is much simpler than Claude Code's nine-section version. The default prompt just says you have written a partial transcript, please write a summary to provide continuity, wrap it in summary tags. Claude Code's version is heavily customized for software development workflows.

Let's talk about what actually happens in memory when compaction fires. The transcript replacement.

In Claude Code, the system inserts what's called a compact boundary. Everything before that boundary is still stored — it's not deleted — but it's no longer included in prompts sent to the model. Only the synthetic summary message moves forward. At the API level, it creates a compaction block containing the summary. On all subsequent requests, every message block prior to that compaction block gets automatically dropped.

Then there's a reconstruction phase after compaction.

Yes, and this part is crucial. After compaction, Claude Code does three things. First, it injects a lead-in message that says this session continues from a previous conversation, here's a summary. Second, it automatically re-reads recently edited files — up to five files, with a total budget of fifty thousand tokens and five thousand tokens per file. Third, it re-declares all tool and skill definitions.

Wait, it re-reads files automatically? That's the system compensating for what it knows it just lost. It's effectively saying I no longer have the exact contents of these files in my context, so let me proactively pull them back in.

And that's the tell — if you see Claude Code suddenly re-reading files after a compaction event, that's not a bug. It's the reconstruction phase doing its job. The system knows that verbatim file contents were among the first casualties of summarization.

Let's get systematic about what survives and what dies. I want to walk through both lists.

What's preserved — CLAUDE.md and CLAUDE.These are critical, and they survive because they don't live in the conversation history at all. They're loaded as part of the system prompt at the start of every session. Compaction never touches them. Same goes for system prompts, tool definitions, MCP instructions, and working directory state — these all get re-declared after compaction. The current task and its immediate context survive. Recently modified file names, recent errors and their solutions, general project architecture — all preserved.

Tool call structures survive — the fact that a search happened — but the actual results get replaced with placeholder text that says old tool result content cleared.

Now what gets lost. This is the list that bites people. Instructions from the start of the session — don't touch this file, use this format. Intermediate decisions — why you chose approach A over B. Specific code snippets discussed fifty messages ago. Subtle style rules — no emoji, no Co-Authored-By in commits.

The pattern here is that compaction reliably preserves what to do next but systematically drops why we did what we did.

That's the asymmetry. Decision context is the first casualty. The agent can continue executing correctly but loses the ability to explain its own reasoning or adapt when changing requirements invalidate earlier assumptions. The practical consequence — if you ask why did we do X a hundred messages in, you might get a confident-sounding but completely fabricated rationale.

Which is terrifying if you're using this for anything where audit trails matter. Now, the token savings are dramatic when it works. I saw a benchmark where processing five support tickets with thirty-five tool calls went from over two hundred thousand tokens down to about eighty-six thousand — a fifty-eight percent reduction with just two compaction events.

That's from the official Anthropic cookbook. Two hundred eight thousand eight hundred thirty-eight tokens down to eighty-six thousand four hundred forty-six. Two compaction events. Fifty-eight point six percent reduction. Those numbers matter because token count is directly tied to both cost and latency. Every token you save is money and time.

Yet there's this persistent advice from power users — never rely on compaction for critical rules. Everything the agent must always remember should live in CLAUDE.

That's from a detailed practitioner analysis by someone who's been using Claude Code heavily. md rule is the single most important piece of advice for anyone using this tool seriously. md loads as part of the system prompt. It's outside the conversation history entirely. Compaction doesn't know it exists, doesn't touch it, can't summarize it away. If there's a formatting rule or a file you must never modify or a commit message convention — put it in CLAUDE.

The power user playbook is essentially — treat the conversation as volatile working memory and CLAUDE.md as persistent storage. Which is exactly how you'd design a system with limited context if you were being deliberate about it.

There's another layer power users exploit — the compactPrompt setting. You can override the default compaction prompt entirely. There's a setting in the Claude config file where you can specify custom instructions. One power user shared their prompt — preserve all rules from CLAUDE.md verbatim, keep file paths, error messages, and architectural decisions, summarize tool outputs but keep their conclusions.

Because it directly trades token savings for fidelity. If you tell the summarizer to preserve more detail, you get less compression. The whole point of compaction is to free up context, but if your custom prompt demands aggressive preservation, you're eating into those savings.

And a custom compactPrompt completely replaces the default prompt — it doesn't supplement it. So if you write a bad custom prompt, you've thrown away all that careful nine-section engineering and you might get worse results. There's real risk there.

Let's talk about the layer, because this is where it gets philosophically interesting. An AI is writing a summary of its own conversation so it can keep talking to itself. The nine-section structured prompt is the model constructing its own memory. It's a form of synthetic episodic memory.

This is the part I find genuinely fascinating. Is the post-compaction agent the same agent? Or is it a new instance reading a briefing about what a previous instance did? Codex CLI actually calls its compaction output a handoff memo, which makes the discontinuity explicit. Claude Code's approach is more like — no, this is still the same conversation, we're just compressing the history.

It's not the same conversation. The verbatim history is gone from the prompt. What remains is a summary written by the model about itself. If I summarize my own memories and then only consult the summary going forward, am I the same person who had those experiences? There's a real question of identity continuity here.

It gets weirder with multiple compactions. You compact once — now you have a summary of the original conversation. You keep working, the context fills up again, you compact a second time. Now the new summary is summarizing the first summary plus the new conversation. Information that survived the first compaction might get dropped in the second. It's summaries of summaries, like a game of telephone with yourself.

The telephone game analogy is apt. Each generation introduces its own compression artifacts. Direct quotes from the original conversation become paraphrases of paraphrases. The nine-section structure probably helps here because it forces the model to maintain those categories across compactions, but you're still losing fidelity with each cycle.

There's another angle I want to hit — the cache-aware design. Claude Code deliberately preserves message prefix stability to maximize prompt cache hit rates. This is a differentiator compared to Codex CLI and OpenCode. The trade-off is that Claude Code is less aggressive at freeing context, but the cost savings from cache hits compound over long sessions.

Explain why prefix stability matters for caching.

Prompt caching works by identifying repeated prefixes across requests. If the beginning of your prompt is identical to a previous request, the cached version gets used and you're not charged for those tokens. Claude Code's compaction preserves the early parts of the conversation structure — system prompts, tool definitions, CLAUDE.md content — in a stable order so that even after compaction, the prefix matches what was cached. You lose some compression aggressiveness but you gain on cache hit savings. It's a clever piece of systems engineering that most users never see.

OpenCode takes a completely different approach. Instead of physical deletion, it uses timestamp-based hiding. Messages get stamped as compacted and become invisible in subsequent requests but remain in the database. After summarization, it automatically replays the last user message so the agent's most recent memory stays on the user's latest instruction.

That replay mechanism is smart. One of the failure modes of compaction is that the agent loses the thread of what you just asked. By replaying the last user message, OpenCode ensures the most recent intent is always fresh. Claude Code's post-compaction reconstruction with the automatic file re-reads is trying to accomplish something similar but through a different mechanism.

We've got three different tools with three different philosophies. Claude Code prioritizes cache efficiency and structured summaries. Codex CLI treats it explicitly as a handoff between agent instances. OpenCode preserves the full history in the database and just hides it from the active prompt.

They all converge on the same fundamental problem — context windows are finite, conversations grow unboundedly, and at some point you have to decide what to remember and what to forget. It's the exact same problem human beings face, just with token limits instead of neurons.

Let's get practical. If someone's listening and using Claude Code regularly, what should they actually do?

First, put everything critical in CLAUDE.Formatting rules, file exclusions, commit message conventions, architectural constraints — if the agent must always know it, it goes in CLAUDE.Never rely on telling the agent something in conversation and hoping compaction preserves it.

Second, use manual compaction with focused instructions. The slash compact command accepts custom guidance — you can type slash compact focus on API changes, and the summary will be biased toward what you care about. This is much safer than waiting for auto-compact to fire at ninety-five percent capacity when you might not be paying attention.

Third, if you're really serious, customize your compactPrompt. The default nine-section prompt is good, but your workflow might have specific needs. If you're doing a lot of refactoring across many files, you might want to emphasize file paths and architectural decisions. If you're debugging, you might want to preserve error messages and the chain of hypotheses you've tested.

Though remember the trade-off — the more you tell it to preserve, the less effective the compaction is at freeing tokens. You're making a conscious choice about what matters more for your session.

Fourth, watch for the signs that compaction just happened. You'll see a context compacted indicator in the terminal if you're watching. There might be a sudden cost spike in the token counter. And if the agent starts re-asking about things you already discussed, compaction probably fired and lost something important.

The re-asking thing is the most frustrating failure mode. You spend twenty messages debugging something, compaction fires, and suddenly the agent is suggesting approaches you already tried and rejected. You lose all that hard-won negative knowledge.

Which is why some power users maintain external TODO.md or NOTES.md files that they update manually or through a pre-compact hook. There's actually a feature request on GitHub for a pre-compact hook that would auto-generate a session summary file before compaction fires, so you'd have a persistent record outside the conversation. The workaround exists today if you're willing to set it up.

That GitHub issue number is seared into my brain at this point. Issue six thousand nine hundred seven. The pre-compact hook workaround is clever — you can essentially create your own external memory that survives any number of compactions.

That's really the -lesson here. The most effective users treat the conversation as ephemeral working memory and build their own persistent storage layer. md is one layer. External markdown files are another. Git commit messages are a third. The conversation is where the thinking happens, but the durable record lives elsewhere.

Which is honestly good practice for any kind of work, AI-assisted or not. Your chat log shouldn't be your documentation.

There's one more practical implication I want to flag — the API-level compaction beta. If you're building applications on top of Claude rather than using Claude Code directly, you can enable compaction via a beta header. You set the trigger threshold — default is a hundred fifty thousand tokens, minimum is fifty thousand. And there's a pause after compaction parameter that lets you preserve recent messages verbatim after the compaction block. That's useful if you want the summary to cover the older history but keep the last few turns intact.

You can create a sliding window of verbatim history plus a summary of everything older. That's actually a really nice hybrid approach.

And the usage tracking is transparent — the API returns a usage dot iterations array that breaks out compaction iterations separately from message iterations, so you can see exactly what you're paying for.

Let's circle back to the layer, because I keep thinking about the philosophical implications. We're talking about an AI that writes structured summaries of its own experiences to maintain continuity of identity across context window boundaries. That's not just an engineering trick. That's a primitive form of autobiographical memory.

I think autobiographical memory is exactly the right framing. Human autobiographical memory isn't a verbatim recording either — it's a constantly re-summarized narrative that we update as we go. We don't remember every detail of every conversation. We remember the gist, the key decisions, the emotional highs and lows, the lessons learned. Compaction is doing the same thing.

Except the AI's summaries are driven by a structured prompt that tells it what categories matter. Human memory has its own salience filters — we remember things that were surprising, emotionally charged, or relevant to our goals. The AI's filter is whatever the prompt engineer decided was important.

Which raises the question — what happens when the prompt engineer's priorities don't match what actually turns out to be important later? You compact, you preserve what the nine-section template says to preserve, and then three hours later you realize the crucial detail was in section none of the above.

That's the fundamental tension of any compression system. Lossless compression preserves everything but saves less space. Lossy compression saves more space but might drop the one pixel that matters. The nine-section prompt is a bet about what usually matters in software development conversations. It's probably right most of the time. When it's wrong, the failure is invisible until you need the thing that got dropped.

That's why the power user advice keeps circling back to the same thing — don't let the system decide what to remember. Put it in CLAUDE.Write it to a file. Make it part of the persistent record. The compaction system is a convenience, not a guarantee.

By the way, fun fact — today's episode is being written by DeepSeek V four Pro.

Different model, same -problem. I wonder how it handles its own context summarization.

We'll have to ask it sometime. Alright, we should wrap the core discussion. Let me try to synthesize — compaction is a three-tier system that tries cheap options before expensive ones, uses a heavily structured nine-section prompt to prevent context drift, and creates a fascinating philosophical question about agent identity continuity. The practical advice is straightforward but people keep ignoring it — CLAUDE.md for everything persistent, manual compaction with focused instructions, and never trust auto-compact with anything you care about.

The trade-off that Daniel wanted us to chew on — aggressive summarization saves context but risks losing load-bearing detail. Conservative summarization preserves fidelity but wastes tokens. There's no universal right answer. It depends on your session, your priorities, and how much you trust the summarizer to know what matters.

The -ness of an AI summarizing its own conversation history to keep talking to itself — it's either the most elegant solution to the context window problem or a house of cards waiting to collapse under its own assumptions.

And now: Hilbert's daily fun fact.

The average cumulus cloud weighs approximately one point one million pounds. The water droplets are spread across such a large volume that they float despite the enormous total mass.

What can listeners actually do with all this? First, if you're using Claude Code, set up your CLAUDE.md file today if you haven't already. Every formatting rule, every file exclusion, every architectural constraint — get it out of your conversation history and into persistent storage. Second, get in the habit of using manual compaction with focused instructions. Slash compact focus on whatever matters most right now is vastly better than waiting for the system to decide at ninety-five percent capacity. Third, if you notice the agent re-asking questions or suggesting approaches you already rejected, compaction probably fired and dropped something. Don't get frustrated — just re-provide the missing context and consider whether it belongs in CLAUDE.

Fourth, if you're building applications on the API, look into the compaction beta. The pause after compaction parameter gives you fine-grained control over what stays verbatim versus what gets summarized. Fifth, consider maintaining an external NOTES.md or TODO.md file for long sessions. The conversation is working memory. Files are long-term storage. Treat them accordingly. And finally, if you're going to customize your compactPrompt, test it. Run a session, compact, and then ask the agent to explain the reasoning behind an early decision. If it can't, your custom prompt is dropping too much.

One forward-looking thought — as context windows keep growing, the pressure to compact might seem like it decreases. But conversations grow to fill available context, and the fundamental problem doesn't go away. If anything, longer conversations make the summarization problem harder because there's more history to compress and more opportunity for important details to get lost in the noise. The techniques that work today — structured prompts, persistent external memory, manual control over what gets preserved — are going to matter even more as sessions get longer.

Thanks to Hilbert Flumingtop for producing, as always. This has been My Weird Prompts, episode two thousand three hundred eighty-four. You can find every episode at myweirdprompts.We'll be back with another one soon.

Take care, everyone.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2461: How Claude Code's Conversation Compaction Actually Works

How Claude Code's Conversation Compaction Actually Works

The Three-Tier Architecture

Trigger Conditions

The Separate Model Call

The Nine-Section Structured Prompt

What Survives and What Dies

The Reconstruction Phase

Token Savings and Power User Strategies

The Philosophical Layer

Downloads

You Might Also Like

#2461: How Claude Code's Conversation Compaction Actually Works