All content is AI-generated and may contain inaccuracies. Please verify independently.

AI & Machine LearningMarch 20, 202615 min read

LLM Compaction in Plain Language: How Auto-Summaries Preserve Context, and Why They Can Create “Summary-Based” Hallucinations

Compaction is the hidden step where LLM apps compress earlier context to fit the context window. Learn where it happens and how to verify what was kept.

Sources

All Stories

Keep Reading

AI & Machine Learning

Context Overflow in Plain English: What LLMs Do When You Exceed the Window (and How to Verify Before Believing)

When you hit an LLM’s context limit, the model doesn’t “pause.” It truncates or compacts, and that can silently erase evidence. Here’s a safe workflow.

March 20, 202617 min read

AI & Machine Learning

LLM Basics, Without the Mystique: Tokens, Context Windows, and the Practical “Verification Loop” Beginners Need

A beginner-friendly guide to how LLMs generate text and how to use them safely for research, drafting, and claim verification.

March 20, 202615 min read

AI & Machine Learning

Decoding the AI Conversation: Why Understanding LLM Tokens and Context is Essential

Understanding the core mechanics of Large Language Models, from tokens to context windows, is crucial for safe and effective use in research and writing. This knowledge empowers users to navigate AI's capabilities and limitations.

March 21, 202612 min read

LLM Compaction in Plain Language: How Auto-Summaries Preserve Context, and Why They Can Create “Summary-Based” Hallucinations | Pulse Latellu

AI & Machine LearningMarch 20, 202615 min read

LLM Compaction in Plain Language: How Auto-Summaries Preserve Context, and Why They Can Create “Summary-Based” Hallucinations

Compaction is the hidden step where LLM apps compress earlier context to fit the context window. Learn where it happens and how to verify what was kept.

1) The quiet default: why compaction matters more than the context-limit myth

A typical LLM session does not simply “fill up” and fail. In many real products, it quietly switches strategies: when the conversation grows too large, the system automatically compresses older messages into a compact representation so the model can keep going. That process is now commonly exposed to developers and power users under names like “compaction” in long-context toolchains and “auto-compact” behavior in certain chat/agent workflows. Anthropic, for example, documents compaction as an automatic summarization mechanism that triggers when the conversation approaches a configured token threshold. (Source)

This is why “LLM basics” can’t stop at tokens and context windows. Beginners often learn the safe habit of verification before believing, but they still miss a second failure mode: when compaction happens, the model may no longer be looking at your original evidence. Instead, it is looking at a summary of your evidence. That summary can be incomplete (dropped details), distorted (reframed language), or subtly wrong (a summary that introduces or misstates facts). Compaction reduces context-overflow errors, yet it shifts risk from “missing text” to “summary-based hallucinations,” where the model confidently extends from what the summary claims.

So the real mental model becomes: tokens and context windows limit what the model can attend to, and compaction decides what portion of your prior tokens survives. In practice, compaction is a preprocessing step in the tokens-to-prompt pipeline, and it matters most when you are using the LLM for research or careful writing, where earlier details should remain inspectable. (Source)

A concrete sense of scale: context windows are large, but not infinite

Modern long-context models have pushed window sizes far beyond the early 8K/16K era. Google’s Gemini 1.5 Pro has been described by Google as supporting up to 1 million tokens in production. (Source) Anthropic has published context-window expansions as well, including a 100K token step for Claude, which they describe as “around 75,000 words.” (Source) And Anthropic’s current API documentation discusses a 1M token context window availability with specific budgets depending on channel. (Source)

Yet even with million-token windows, compaction remains valuable because prompts include more than just your text. Tool outputs, system instructions, conversation history, and sometimes hidden or internal overhead also consume tokens. OpenAI’s API documentation for GPT-4o states a 128,000 context window, reinforcing that most deployments still run inside hard budgets even when the numbers sound huge. (Source)

When budgets are finite, compaction becomes a default control system. The rest of this guide will treat compaction as a first-class concept, not an incidental detail.

2) Compaction in the tokens-to-context pipeline: where it sits and what it changes

Think of the pipeline like this:

You send input (prompt, documents, previous chat turns, tool results).
The system counts tokens to decide what fits.
If the prompt is nearing a threshold, the system may generate a compaction summary of older context and replace that older context with the condensed version.
The model continues using the compacted context.

Anthropic’s compaction documentation describes exactly this: it “extends the effective context length for long-running conversations and tasks” by automatically summarizing older context when approaching the window limit, producing a “compaction block” that the model then continues from. (Source)

This is the operational hinge where the beginner-to-intermediate safety lesson should land. It’s not merely that the model may “forget” when the context window is hit; it’s that an intermediate transformation step rewrites what counts as context.

Here’s the key analytical shift: compaction changes the evidence state, not just the evidence volume. With truncation, the model stops seeing earlier text entirely. With compaction, the model continues—using newly produced tokens that attempt to stand in for earlier tokens. That substitution can be faithful, but it can also be non-faithful in systematic ways:

Coverage failures: details that are rare in the source can be omitted from the compressed representation even if they matter for later decisions.
Constraint drift: qualifiers, negations, and scope conditions can be simplified or dropped during summarization.
Entity and relation mistakes: names and “who-did-what” links can be paraphrased into a different structure without an obvious formatting signal that something changed.

In other words, compaction doesn’t just “save space.” It creates a second-stage dependency: later outputs are conditioned on the model’s earlier compression judgment, not only on your original inputs. That is what makes the failure mode “summary-based hallucination” rather than plain truncation failure.

So what should you expect when compaction is active? Practically, you should assume that the model’s accessible context becomes a mixture: some tokens are still directly grounded in your original conversation, while others are model-generated replacements whose correctness you cannot infer from their confidence. The more your task depends on details that are easy to paraphrase (dates, numbers, attributions, exception clauses), the more valuable it is to treat compaction as a transformation you can audit—not a background implementation detail.

Tokens versus “memory”: compaction is not storage, it is rewriting

Tokens are the accounting unit. Compaction is the operational response. It is not a durable database of what you wrote. It is closer to a rewrite policy for what the model will see going forward.

That distinction matters when you run a verification loop. A verification loop should be built around the question: “Did the model verify the summary’s content, or did it verify the original materials?” Many users assume these are the same. Under compaction, they can diverge.

Also, compaction can interact with other “memory-like” features in confusing ways. For example, OpenAI describes ChatGPT “Memory” as a separate mechanism from the raw conversation context and provides controls to manage and delete saved memories. (Source) In contrast, compaction is typically session-level compression of earlier conversation context to keep the immediate prompt within limits. Beginners should not treat these systems as equivalent.

3) Why compaction reduces overflow errors but can increase summary-based hallucinations

It’s tempting to frame compaction as a pure improvement: fewer “I can’t continue” moments, smoother long work sessions, and less repeated re-prompting. That is genuinely a benefit. Summarizing older context early can also help performance: you keep the model focused on the “shape” of the task rather than forcing it to re-read everything.

But there is an editorial price tag. Summaries are lossy. Even when they are accurate in spirit, they can change specificity. A dropped paragraph can remove evidence needed for a tight claim. A paraphrase can change who said what. A summary can also inadvertently become a generator of new statements if the summarizer model fills in gaps. In long-context workflows, “summarize then answer” can amplify that effect because the answer stage treats the summary as context truth.

Research on hallucination patterns in summarization is directly relevant here. In a multi-document summarization study, the authors report that on average “up to 75% of the content” in LLM-generated summaries was hallucinated, with hallucinations more likely toward the end of summaries. (Source) While this paper targets multi-document summarization rather than compaction summaries specifically, it illustrates a general risk: summaries can contain content not grounded in the input, especially as summaries get longer or more inferential.

To be clear, compaction summaries are not identical to “write me a marketing summary of three articles.” But both involve transforming content into a compact representation the next step will reuse. When summaries contain errors, the reuse step can turn those errors into confident outputs.

The deeper reason hallucinations persist: evaluation incentives can still reward guessing

Even if compaction is high-quality, hallucinations are not purely a context-window phenomenon. OpenAI has argued that hallucinations occur partly because standard training and evaluation procedures reward guessing over acknowledging uncertainty. (Source) That means: if the system needs to continue and has incomplete grounding, it can still produce fluent content that “sounds right.” Under compaction, that incomplete grounding might come from missing details in the compacted representation.

So the practical stance is not “compaction is bad.” It is “compaction shifts what you must verify.” You verify the summary because the summary becomes the input to later reasoning.

4) A verification loop that checks what compaction kept, not just what the model said

Here’s a simple verification loop designed for beginner-to-intermediate users who want to reduce summary-based hallucinations without turning the workflow into bureaucracy.

Step A: Force the model to separate “kept context” from “new claims”

Ask for an explicit recap of what the model is using. For example:

“List the key facts from the compacted summary you are relying on.”
“Mark which facts come from my provided documents versus what you inferred.”
“For any fact you are uncertain about, say so.”

This doesn’t guarantee perfect honesty, but it creates a structured artifact you can audit. It also gives you something to compare against your source materials.

Step B: Ask the model to verify using a claim-by-claim rubric

Then run a second pass:

“For each listed fact, provide the supporting quote or pointer to my original text.”
“If you cannot find a supporting quote in my material, label it as unverified.”

This resembles established research directions for reducing hallucinations through verification-like prompting. One example is “Chain-of-Verification (CoVe),” which drafts a response, plans verification questions, answers them independently, and then generates a final verified response. (Source) Even if your implementation is manual rather than algorithmic, the core idea is to stop trusting the first draft as the ground truth.

Step C: Compare against “dropped evidence” by asking for “missing constraints”

A third pass catches the most common compaction harm: dropped constraints.

“What details from the original materials might have been dropped or compressed?”
“What assumptions are necessary for your answer to be correct, and where would they fail if those assumptions were wrong?”

You are asking the model to highlight fragility. A good answer often reveals what the summary lacks.

A practical warning: compaction is not the same as context overflow at the ceiling

Even if you are careful, compaction can trigger before you “hit the limit.” Anthropic’s compaction description indicates it activates when approaching a configured token threshold, not only at the hard ceiling. (Source) That means your workflow should treat compaction as recurring, not exceptional. Build your verification loop so it can run every time you change a task phase (for instance, after you switch from summarizing to arguing, or after you request a new section in a draft).

5) Real-world use cases: where compaction shows up in workflows, and what outcomes you should expect

Product behaviors vary, but compaction-like mechanisms appear in toolchains designed for long tasks. Below are documented examples where compaction is explicitly part of the system design or where long-horizon workflows require summaries to continue.

Case 1: Anthropic Claude compaction, documented as automatic summarization to continue long conversations

Anthropic documents compaction as a feature that automatically summarizes older context when approaching a token threshold, replacing it with a compaction block so the conversation can continue. (Source)
Outcome: longer-running sessions with less context overflow.
Timeline: Anthropic’s compaction documentation describes an updated compaction header pattern, indicating a product feature shipped into the Claude API workflow. (Source)
Practical lesson: when you depend on details earlier in the thread, you must verify claims against the original materials, because the model is no longer guaranteed to “see” those materials verbatim.

Case 2: Claude long-context expansion, pushing users into workflows that depend on compression strategies

Anthropic’s “100K Context Windows” announcement highlights a move from 9K to 100K tokens, corresponding to around 75,000 words, and states these context windows were made available in their API. (Source)
Outcome: more content can be provided at once, but system-level prompt budgets still tend to tighten when you add tool outputs, instructions, and multi-turn history.
Timeline: first announced in Anthropic’s 100K context window post (published when they rolled out that expansion). (Source)
Practical lesson: even when the window is large enough to delay compaction, long-horizon workflows often still create secondary summaries (research notes, intermediate outlines, “what we decided” recaps). The risk you manage doesn’t disappear—it shifts from “implicit compaction only” to “implicit compaction plus user- or app-generated compression.”

Case 3: Gemini 1.5 Pro’s 1M token window changes the user experience, but verification still matters

Google has described Gemini 1.5 Pro as supporting up to 1 million tokens in production. (Source)
Outcome: users can include much more text per session, potentially delaying compaction triggers and reducing how often a system must replace earlier context with a condensed block.
Timeline: announced in Google’s February 2024 post. (Source)
Practical lesson: compaction-like transformations remain likely whenever applications produce “condensed briefing” artifacts internally, even if the base model’s window is huge. In other words, verification isn’t just for the hard threshold—it’s for any stage where the system produces a compressed representation that later steps treat as ground truth.

Case 4: Hallucination in summarization is not hypothetical, and it informs how you treat compacted context

A multi-document summarization benchmark found that hallucination can dominate summary content, reporting “up to 75%” hallucinated content on average across evaluated models. (Source)
Outcome: it becomes plausible that compacted representations can carry forward inaccuracies.
Timeline: research published as a preprint; the arXiv record indicates the study’s timeframe. (Source)
Practical lesson: do not assume compacted summaries are inherently safer than long transcripts. Treat them as artifacts requiring verification.

6) Safety-by-design for research and writing: how to use compaction-friendly prompts

For research and drafting, you want two properties from an LLM assistant: (1) stable grounding, and (2) explicit uncertainty when grounding is missing. Compaction affects both.

Prompt pattern 1: “Grounded extraction” before “narrative”

Before you ask for an argument or synthesis, ask for a structured extraction from your sources:

“Extract all factual claims that can be traced to the provided text.”
“For each claim, include the exact supporting sentence or a precise pointer (section/page/paragraph).”

This approach encourages the model to build a claim ledger. Then you can ask for writing only after the ledger exists. You reduce the chance that compaction will erase the evidence needed for later prose.

Prompt pattern 2: “Rewrite the summary, then audit it”

If the assistant is likely to compact, you can ask it to re-summarize deliberately and then audit:

“Create a compaction summary of the prior materials, but list what is missing.”
“Then verify each compaction element against the original text.”

This converts compaction from a hidden operation into an explicit, reviewable step, so your verification loop has something tangible to check. Anthropic’s documentation frames compaction as an automatic summarization block; asking for a controlled “compaction summary” is a way to regain agency over that transformation. (Source)

Prompt pattern 3: use evaluation-minded verification, not vibes

OpenAI’s work on hallucinations emphasizes that incentives and evaluation matter, and that models can be pushed toward confident guessing. (Source) In your workflow, the practical translation is: ask for checks that force the model to tie claims to evidence and label unverified items.

One more relevant research direction: chain-of-verification reduces hallucinations across tasks in experiments. (Source) Your “verification loop” is basically a human-operated analog of that idea.

7) What to do next: a forecast, and a policy recommendation for teams deploying LLMs with compaction

Compaction is becoming the practical default behavior in long-running LLM applications, because context budgets are real and because automatically summarizing earlier messages is the cheapest way to keep sessions alive. Anthropic explicitly documents compaction as an automatic summarization step triggered near thresholds. (Source) As long-context models scale up, compaction may trigger less often, but it will not disappear, because writing and analysis workflows still benefit from condensation.

Forecast with a timeline

By Q4 2026, teams should plan for verification-loop UX patterns—such as a visible “compacted summary” pane or an audit trail of retained context—to become more standard in professional tooling, particularly where compliance or editorial review matters. The technical reason is straightforward: compaction introduces a transformation step that can be surfaced and tested, and verification-like methods already have research support for reducing hallucinations. (Source) Compaction itself is documented and therefore can be operationalized in product design. (Source)

(Reality check: this is a planning forecast, not a guarantee. Model families and app designers will vary in whether they expose compaction explicitly, and some may prefer “shadow” summaries that never show an audit pane—so your policy should assume you may not get perfect visibility.)

Policy recommendation for practitioners and organizations

Adopt a deployment policy with a single rule: no evidence-carrying claims may be taken from a compaction summary without a verification step tied to original sources. Concretely, require teams using LLM compaction to implement a two-stage workflow for research and writing:

Extraction stage: collect claim ledger items with source pointers.
Writing stage: draft narrative using only ledger items marked verified, and re-run the ledger verification if the system performs compaction again.

This policy directly targets the “summary-based hallucination” risk introduced by compaction. It also aligns with the broader understanding that hallucinations persist when systems are rewarded for confident guessing without uncertainty-aware evaluation. (Source)

The short version: compaction is not a trust shortcut. It is a convenience layer. Your job is to make sure the convenience layer never becomes the final author of your facts.

Sources

All Stories

1) The quiet default: why compaction matters more than the context-limit myth

A concrete sense of scale: context windows are large, but not infinite

When budgets are finite, compaction becomes a default control system. The rest of this guide will treat compaction as a first-class concept, not an incidental detail.

2) Compaction in the tokens-to-context pipeline: where it sits and what it changes

Think of the pipeline like this:

You send input (prompt, documents, previous chat turns, tool results).
The system counts tokens to decide what fits.
If the prompt is nearing a threshold, the system may generate a compaction summary of older context and replace that older context with the condensed version.
The model continues using the compacted context.

Coverage failures: details that are rare in the source can be omitted from the compressed representation even if they matter for later decisions.
Constraint drift: qualifiers, negations, and scope conditions can be simplified or dropped during summarization.
Entity and relation mistakes: names and “who-did-what” links can be paraphrased into a different structure without an obvious formatting signal that something changed.

Tokens versus “memory”: compaction is not storage, it is rewriting

Tokens are the accounting unit. Compaction is the operational response. It is not a durable database of what you wrote. It is closer to a rewrite policy for what the model will see going forward.

3) Why compaction reduces overflow errors but can increase summary-based hallucinations

The deeper reason hallucinations persist: evaluation incentives can still reward guessing

So the practical stance is not “compaction is bad.” It is “compaction shifts what you must verify.” You verify the summary because the summary becomes the input to later reasoning.

4) A verification loop that checks what compaction kept, not just what the model said

Here’s a simple verification loop designed for beginner-to-intermediate users who want to reduce summary-based hallucinations without turning the workflow into bureaucracy.

Step A: Force the model to separate “kept context” from “new claims”

Ask for an explicit recap of what the model is using. For example:

“List the key facts from the compacted summary you are relying on.”
“Mark which facts come from my provided documents versus what you inferred.”
“For any fact you are uncertain about, say so.”

This doesn’t guarantee perfect honesty, but it creates a structured artifact you can audit. It also gives you something to compare against your source materials.

Step B: Ask the model to verify using a claim-by-claim rubric

Then run a second pass:

“For each listed fact, provide the supporting quote or pointer to my original text.”
“If you cannot find a supporting quote in my material, label it as unverified.”

Step C: Compare against “dropped evidence” by asking for “missing constraints”

A third pass catches the most common compaction harm: dropped constraints.

“What details from the original materials might have been dropped or compressed?”
“What assumptions are necessary for your answer to be correct, and where would they fail if those assumptions were wrong?”

You are asking the model to highlight fragility. A good answer often reveals what the summary lacks.

A practical warning: compaction is not the same as context overflow at the ceiling

5) Real-world use cases: where compaction shows up in workflows, and what outcomes you should expect

Case 1: Anthropic Claude compaction, documented as automatic summarization to continue long conversations

Case 2: Claude long-context expansion, pushing users into workflows that depend on compression strategies

Case 3: Gemini 1.5 Pro’s 1M token window changes the user experience, but verification still matters

Case 4: Hallucination in summarization is not hypothetical, and it informs how you treat compacted context

6) Safety-by-design for research and writing: how to use compaction-friendly prompts

For research and drafting, you want two properties from an LLM assistant: (1) stable grounding, and (2) explicit uncertainty when grounding is missing. Compaction affects both.

Prompt pattern 1: “Grounded extraction” before “narrative”

Before you ask for an argument or synthesis, ask for a structured extraction from your sources:

“Extract all factual claims that can be traced to the provided text.”
“For each claim, include the exact supporting sentence or a precise pointer (section/page/paragraph).”

Prompt pattern 2: “Rewrite the summary, then audit it”

If the assistant is likely to compact, you can ask it to re-summarize deliberately and then audit:

“Create a compaction summary of the prior materials, but list what is missing.”
“Then verify each compaction element against the original text.”

Prompt pattern 3: use evaluation-minded verification, not vibes

7) What to do next: a forecast, and a policy recommendation for teams deploying LLMs with compaction

Forecast with a timeline

Policy recommendation for practitioners and organizations

Extraction stage: collect claim ledger items with source pointers.
Writing stage: draft narrative using only ledger items marked verified, and re-run the ledger verification if the system performs compaction again.

The short version: compaction is not a trust shortcut. It is a convenience layer. Your job is to make sure the convenience layer never becomes the final author of your facts.

Trending Topics

Browse by Category

Sources

Keep Reading

Context Overflow in Plain English: What LLMs Do When You Exceed the Window (and How to Verify Before Believing)

LLM Basics, Without the Mystique: Tokens, Context Windows, and the Practical “Verification Loop” Beginners Need

Decoding the AI Conversation: Why Understanding LLM Tokens and Context is Essential

Trending Topics

Browse by Category

1) The quiet default: why compaction matters more than the context-limit myth

A concrete sense of scale: context windows are large, but not infinite

2) Compaction in the tokens-to-context pipeline: where it sits and what it changes

Tokens versus “memory”: compaction is not storage, it is rewriting

3) Why compaction reduces overflow errors but can increase summary-based hallucinations

The deeper reason hallucinations persist: evaluation incentives can still reward guessing

4) A verification loop that checks what compaction kept, not just what the model said

Step A: Force the model to separate “kept context” from “new claims”

Step B: Ask the model to verify using a claim-by-claim rubric

Step C: Compare against “dropped evidence” by asking for “missing constraints”

A practical warning: compaction is not the same as context overflow at the ceiling

5) Real-world use cases: where compaction shows up in workflows, and what outcomes you should expect

Case 1: Anthropic Claude compaction, documented as automatic summarization to continue long conversations

Case 2: Claude long-context expansion, pushing users into workflows that depend on compression strategies

Case 3: Gemini 1.5 Pro’s 1M token window changes the user experience, but verification still matters

Case 4: Hallucination in summarization is not hypothetical, and it informs how you treat compacted context

6) Safety-by-design for research and writing: how to use compaction-friendly prompts

Prompt pattern 1: “Grounded extraction” before “narrative”

Prompt pattern 2: “Rewrite the summary, then audit it”

Prompt pattern 3: use evaluation-minded verification, not vibes

7) What to do next: a forecast, and a policy recommendation for teams deploying LLMs with compaction

Forecast with a timeline

Policy recommendation for practitioners and organizations

Sources

1) The quiet default: why compaction matters more than the context-limit myth

A concrete sense of scale: context windows are large, but not infinite

2) Compaction in the tokens-to-context pipeline: where it sits and what it changes

Tokens versus “memory”: compaction is not storage, it is rewriting

3) Why compaction reduces overflow errors but can increase summary-based hallucinations

The deeper reason hallucinations persist: evaluation incentives can still reward guessing

4) A verification loop that checks what compaction kept, not just what the model said

Step A: Force the model to separate “kept context” from “new claims”

Step B: Ask the model to verify using a claim-by-claim rubric

Step C: Compare against “dropped evidence” by asking for “missing constraints”

A practical warning: compaction is not the same as context overflow at the ceiling

5) Real-world use cases: where compaction shows up in workflows, and what outcomes you should expect

Case 1: Anthropic Claude compaction, documented as automatic summarization to continue long conversations

Case 2: Claude long-context expansion, pushing users into workflows that depend on compression strategies

Case 3: Gemini 1.5 Pro’s 1M token window changes the user experience, but verification still matters

Case 4: Hallucination in summarization is not hypothetical, and it informs how you treat compacted context

6) Safety-by-design for research and writing: how to use compaction-friendly prompts

Prompt pattern 1: “Grounded extraction” before “narrative”

Prompt pattern 2: “Rewrite the summary, then audit it”

Prompt pattern 3: use evaluation-minded verification, not vibes

7) What to do next: a forecast, and a policy recommendation for teams deploying LLMs with compaction

Forecast with a timeline

Policy recommendation for practitioners and organizations

Keep Reading

Context Overflow in Plain English: What LLMs Do When You Exceed the Window (and How to Verify Before Believing)

LLM Basics, Without the Mystique: Tokens, Context Windows, and the Practical “Verification Loop” Beginners Need

Decoding the AI Conversation: Why Understanding LLM Tokens and Context is Essential