—·
Compaction is the hidden step where LLM apps compress earlier context to fit the context window. Learn where it happens and how to verify what was kept.
A typical LLM session does not simply “fill up” and fail. In many real products, it quietly switches strategies: when the conversation grows too large, the system automatically compresses older messages into a compact representation so the model can keep going. That process is now commonly exposed to developers and power users under names like “compaction” in long-context toolchains and “auto-compact” behavior in certain chat/agent workflows. Anthropic, for example, documents compaction as an automatic summarization mechanism that triggers when the conversation approaches a configured token threshold. (Source)
This is why “LLM basics” can’t stop at tokens and context windows. Beginners often learn the safe habit of verification before believing, but they still miss a second failure mode: when compaction happens, the model may no longer be looking at your original evidence. Instead, it is looking at a summary of your evidence. That summary can be incomplete (dropped details), distorted (reframed language), or subtly wrong (a summary that introduces or misstates facts). Compaction reduces context-overflow errors, yet it shifts risk from “missing text” to “summary-based hallucinations,” where the model confidently extends from what the summary claims.
So the real mental model becomes: tokens and context windows limit what the model can attend to, and compaction decides what portion of your prior tokens survives. In practice, compaction is a preprocessing step in the tokens-to-prompt pipeline, and it matters most when you are using the LLM for research or careful writing, where earlier details should remain inspectable. (Source)
Modern long-context models have pushed window sizes far beyond the early 8K/16K era. Google’s Gemini 1.5 Pro has been described by Google as supporting up to 1 million tokens in production. (Source) Anthropic has published context-window expansions as well, including a 100K token step for Claude, which they describe as “around 75,000 words.” (Source) And Anthropic’s current API documentation discusses a 1M token context window availability with specific budgets depending on channel. (Source)
Yet even with million-token windows, compaction remains valuable because prompts include more than just your text. Tool outputs, system instructions, conversation history, and sometimes hidden or internal overhead also consume tokens. OpenAI’s API documentation for GPT-4o states a 128,000 context window, reinforcing that most deployments still run inside hard budgets even when the numbers sound huge. (Source)
When budgets are finite, compaction becomes a default control system. The rest of this guide will treat compaction as a first-class concept, not an incidental detail.
Think of the pipeline like this:
Anthropic’s compaction documentation describes exactly this: it “extends the effective context length for long-running conversations and tasks” by automatically summarizing older context when approaching the window limit, producing a “compaction block” that the model then continues from. (Source)
This is the operational hinge where the beginner-to-intermediate safety lesson should land. It’s not merely that the model may “forget” when the context window is hit; it’s that an intermediate transformation step rewrites what counts as context.
Here’s the key analytical shift: compaction changes the evidence state, not just the evidence volume. With truncation, the model stops seeing earlier text entirely. With compaction, the model continues—using newly produced tokens that attempt to stand in for earlier tokens. That substitution can be faithful, but it can also be non-faithful in systematic ways:
In other words, compaction doesn’t just “save space.” It creates a second-stage dependency: later outputs are conditioned on the model’s earlier compression judgment, not only on your original inputs. That is what makes the failure mode “summary-based hallucination” rather than plain truncation failure.
So what should you expect when compaction is active? Practically, you should assume that the model’s accessible context becomes a mixture: some tokens are still directly grounded in your original conversation, while others are model-generated replacements whose correctness you cannot infer from their confidence. The more your task depends on details that are easy to paraphrase (dates, numbers, attributions, exception clauses), the more valuable it is to treat compaction as a transformation you can audit—not a background implementation detail.
Tokens are the accounting unit. Compaction is the operational response. It is not a durable database of what you wrote. It is closer to a rewrite policy for what the model will see going forward.
That distinction matters when you run a verification loop. A verification loop should be built around the question: “Did the model verify the summary’s content, or did it verify the original materials?” Many users assume these are the same. Under compaction, they can diverge.
Also, compaction can interact with other “memory-like” features in confusing ways. For example, OpenAI describes ChatGPT “Memory” as a separate mechanism from the raw conversation context and provides controls to manage and delete saved memories. (Source) In contrast, compaction is typically session-level compression of earlier conversation context to keep the immediate prompt within limits. Beginners should not treat these systems as equivalent.
It’s tempting to frame compaction as a pure improvement: fewer “I can’t continue” moments, smoother long work sessions, and less repeated re-prompting. That is genuinely a benefit. Summarizing older context early can also help performance: you keep the model focused on the “shape” of the task rather than forcing it to re-read everything.
But there is an editorial price tag. Summaries are lossy. Even when they are accurate in spirit, they can change specificity. A dropped paragraph can remove evidence needed for a tight claim. A paraphrase can change who said what. A summary can also inadvertently become a generator of new statements if the summarizer model fills in gaps. In long-context workflows, “summarize then answer” can amplify that effect because the answer stage treats the summary as context truth.
Research on hallucination patterns in summarization is directly relevant here. In a multi-document summarization study, the authors report that on average “up to 75% of the content” in LLM-generated summaries was hallucinated, with hallucinations more likely toward the end of summaries. (Source) While this paper targets multi-document summarization rather than compaction summaries specifically, it illustrates a general risk: summaries can contain content not grounded in the input, especially as summaries get longer or more inferential.
To be clear, compaction summaries are not identical to “write me a marketing summary of three articles.” But both involve transforming content into a compact representation the next step will reuse. When summaries contain errors, the reuse step can turn those errors into confident outputs.
Even if compaction is high-quality, hallucinations are not purely a context-window phenomenon. OpenAI has argued that hallucinations occur partly because standard training and evaluation procedures reward guessing over acknowledging uncertainty. (Source) That means: if the system needs to continue and has incomplete grounding, it can still produce fluent content that “sounds right.” Under compaction, that incomplete grounding might come from missing details in the compacted representation.
So the practical stance is not “compaction is bad.” It is “compaction shifts what you must verify.” You verify the summary because the summary becomes the input to later reasoning.
Here’s a simple verification loop designed for beginner-to-intermediate users who want to reduce summary-based hallucinations without turning the workflow into bureaucracy.
Ask for an explicit recap of what the model is using. For example:
This doesn’t guarantee perfect honesty, but it creates a structured artifact you can audit. It also gives you something to compare against your source materials.
Then run a second pass:
This resembles established research directions for reducing hallucinations through verification-like prompting. One example is “Chain-of-Verification (CoVe),” which drafts a response, plans verification questions, answers them independently, and then generates a final verified response. (Source) Even if your implementation is manual rather than algorithmic, the core idea is to stop trusting the first draft as the ground truth.
A third pass catches the most common compaction harm: dropped constraints.
You are asking the model to highlight fragility. A good answer often reveals what the summary lacks.
Even if you are careful, compaction can trigger before you “hit the limit.” Anthropic’s compaction description indicates it activates when approaching a configured token threshold, not only at the hard ceiling. (Source) That means your workflow should treat compaction as recurring, not exceptional. Build your verification loop so it can run every time you change a task phase (for instance, after you switch from summarizing to arguing, or after you request a new section in a draft).
Product behaviors vary, but compaction-like mechanisms appear in toolchains designed for long tasks. Below are documented examples where compaction is explicitly part of the system design or where long-horizon workflows require summaries to continue.
Anthropic documents compaction as a feature that automatically summarizes older context when approaching a token threshold, replacing it with a compaction block so the conversation can continue. (Source)
Outcome: longer-running sessions with less context overflow.
Timeline: Anthropic’s compaction documentation describes an updated compaction header pattern, indicating a product feature shipped into the Claude API workflow. (Source)
Practical lesson: when you depend on details earlier in the thread, you must verify claims against the original materials, because the model is no longer guaranteed to “see” those materials verbatim.
Anthropic’s “100K Context Windows” announcement highlights a move from 9K to 100K tokens, corresponding to around 75,000 words, and states these context windows were made available in their API. (Source)
Outcome: more content can be provided at once, but system-level prompt budgets still tend to tighten when you add tool outputs, instructions, and multi-turn history.
Timeline: first announced in Anthropic’s 100K context window post (published when they rolled out that expansion). (Source)
Practical lesson: even when the window is large enough to delay compaction, long-horizon workflows often still create secondary summaries (research notes, intermediate outlines, “what we decided” recaps). The risk you manage doesn’t disappear—it shifts from “implicit compaction only” to “implicit compaction plus user- or app-generated compression.”
Google has described Gemini 1.5 Pro as supporting up to 1 million tokens in production. (Source)
Outcome: users can include much more text per session, potentially delaying compaction triggers and reducing how often a system must replace earlier context with a condensed block.
Timeline: announced in Google’s February 2024 post. (Source)
Practical lesson: compaction-like transformations remain likely whenever applications produce “condensed briefing” artifacts internally, even if the base model’s window is huge. In other words, verification isn’t just for the hard threshold—it’s for any stage where the system produces a compressed representation that later steps treat as ground truth.
A multi-document summarization benchmark found that hallucination can dominate summary content, reporting “up to 75%” hallucinated content on average across evaluated models. (Source)
Outcome: it becomes plausible that compacted representations can carry forward inaccuracies.
Timeline: research published as a preprint; the arXiv record indicates the study’s timeframe. (Source)
Practical lesson: do not assume compacted summaries are inherently safer than long transcripts. Treat them as artifacts requiring verification.
For research and drafting, you want two properties from an LLM assistant: (1) stable grounding, and (2) explicit uncertainty when grounding is missing. Compaction affects both.
Before you ask for an argument or synthesis, ask for a structured extraction from your sources:
This approach encourages the model to build a claim ledger. Then you can ask for writing only after the ledger exists. You reduce the chance that compaction will erase the evidence needed for later prose.
If the assistant is likely to compact, you can ask it to re-summarize deliberately and then audit:
This converts compaction from a hidden operation into an explicit, reviewable step, so your verification loop has something tangible to check. Anthropic’s documentation frames compaction as an automatic summarization block; asking for a controlled “compaction summary” is a way to regain agency over that transformation. (Source)
OpenAI’s work on hallucinations emphasizes that incentives and evaluation matter, and that models can be pushed toward confident guessing. (Source) In your workflow, the practical translation is: ask for checks that force the model to tie claims to evidence and label unverified items.
One more relevant research direction: chain-of-verification reduces hallucinations across tasks in experiments. (Source) Your “verification loop” is basically a human-operated analog of that idea.
Compaction is becoming the practical default behavior in long-running LLM applications, because context budgets are real and because automatically summarizing earlier messages is the cheapest way to keep sessions alive. Anthropic explicitly documents compaction as an automatic summarization step triggered near thresholds. (Source) As long-context models scale up, compaction may trigger less often, but it will not disappear, because writing and analysis workflows still benefit from condensation.
By Q4 2026, teams should plan for verification-loop UX patterns—such as a visible “compacted summary” pane or an audit trail of retained context—to become more standard in professional tooling, particularly where compliance or editorial review matters. The technical reason is straightforward: compaction introduces a transformation step that can be surfaced and tested, and verification-like methods already have research support for reducing hallucinations. (Source) Compaction itself is documented and therefore can be operationalized in product design. (Source)
(Reality check: this is a planning forecast, not a guarantee. Model families and app designers will vary in whether they expose compaction explicitly, and some may prefer “shadow” summaries that never show an audit pane—so your policy should assume you may not get perfect visibility.)
Adopt a deployment policy with a single rule: no evidence-carrying claims may be taken from a compaction summary without a verification step tied to original sources. Concretely, require teams using LLM compaction to implement a two-stage workflow for research and writing:
This policy directly targets the “summary-based hallucination” risk introduced by compaction. It also aligns with the broader understanding that hallucinations persist when systems are rewarded for confident guessing without uncertainty-aware evaluation. (Source)
The short version: compaction is not a trust shortcut. It is a convenience layer. Your job is to make sure the convenience layer never becomes the final author of your facts.
When you hit an LLM’s context limit, the model doesn’t “pause.” It truncates or compacts, and that can silently erase evidence. Here’s a safe workflow.
A beginner-friendly guide to how LLMs generate text and how to use them safely for research, drafting, and claim verification.
Understanding the core mechanics of Large Language Models, from tokens to context windows, is crucial for safe and effective use in research and writing. This knowledge empowers users to navigate AI's capabilities and limitations.