All content is AI-generated and may contain inaccuracies. Please verify independently.

AI & Machine LearningMarch 25, 202614 min read

One Million Tokens in GPT-5.4: Selective Recall, Governance Deadlines, and When RAG Still Wins

A one million token context window changes how enterprises retrieve knowledge, but it does not remove the need for RAG, governance, and evaluation discipline.

All Stories

Keep Reading

AI Policy

AI Policy and the 1 Million-Token Reality: What GPT-5.4 Forces Your Systems to Change

A 1 million-token context window is not “more room” for prompts. It changes cost, routing, caching, evaluation risk, and the way you build policy-compliant AI workflows with GPT-5.4.

March 28, 202615 min read

AI & Machine Learning

One Million Tokens and the Enterprise Trap: How to Govern Long-Context AI Without Losing Accuracy

A million-token window changes prompting economics, but it scales governance, auditability, and stale-source risk. Here is the operational stack.

March 25, 202615 min read

AI & Machine Learning

Context Governance for a Million-Token Era: Retrieval Rigs, Traceable Citations, and Agentic AI

A field guide for redesigning enterprise knowledge work around million-token context: what to stuff, what to exclude, how to measure, and how to govern.

March 25, 202614 min read

AI & Machine LearningMarch 25, 202614 min read

One Million Tokens in GPT-5.4: Selective Recall, Governance Deadlines, and When RAG Still Wins

A one million token context window changes how enterprises retrieve knowledge, but it does not remove the need for RAG, governance, and evaluation discipline.

One Million Tokens in GPT-5.4: Selective Recall, Governance Deadlines, and When RAG Still Wins

One million tokens is a ceiling

The headline number is eye-catching: GPT-5.4-class models can accept a one million token context window. A “context window” is the maximum amount of text the model can read in a single request, and “one million tokens” is the upper bound of that input plus (depending on the system) relevant conversation state you choose to send. (Source).

For enterprise knowledge work, the real question isn’t whether a model can ingest a lot of text. It’s whether you can do it economically, reliably, and with auditable governance. Two teams can both claim “long context,” but one ships a system that’s fast, traceable, and safe, while the other delivers something slow, costly, and hard to audit. That gap is engineering practice.

A helpful way to frame this shift is from “retrieval as default” to “selective recall as default.” With long-context models, you can sometimes stage a large portion of your internal knowledge directly in the prompt. You still need a retrieval layer to determine what subset should be present, what should be redacted, and what must be referenced after the fact. This isn’t a philosophical change. It’s an architecture decision shaped by latency, cost, and compliance obligations. (For risk management and governance expectations, see NIST AI RMF and related crosswalk work.) (Source).

So what: Treat “one million token context” as a ceiling on what you can stage for a request, not as a strategy. Decide in advance which knowledge is allowed into model input, and which knowledge must be retrieved, cited, and logged on demand.

RAG still matters even with long context

RAG (Retrieval-Augmented Generation) fetches relevant documents first, then the model generates an answer grounded in those documents. With long-context designs, you may retrieve less often because more material can fit in the prompt. That reduction isn’t automatic, though. It depends on whether your documents are cleanly structured, whether user questions align with what you already included, and whether you can keep the system fresh and performant.

The risk-management logic doesn’t change--only the mechanism. Ask the model to answer from a large prompt that includes old policies or superseded guidance, and you can still get outputs that fail internal correctness requirements. NIST’s AI RMF emphasizes mapping risks to governance activities such as measurement and monitoring, and communicating outcomes. For engineering, that means document versioning, evaluation against “current policy” datasets, and audit trails for what content was actually in-context at generation time. (Source).

Long context also introduces different failure modes. With RAG, you typically control the evidence set via retrieval filters. With prompt stuffing, you widen the evidence set unless you invest in aggressive prompt compaction and selection. OWASP’s LLM-specific risk guidance for input handling and application security is a reminder that prompt composition isn’t neutral--it’s part of the attack surface and part of the risk profile. (Source; Source).

So what: Avoid treating RAG vs long-context as an either-or debate. Build a hybrid policy: use long-context for stable, governance-approved “working sets,” and use RAG (with citations and logs) for time-sensitive, user-specific, or compliance-sensitive queries.

Document governance becomes a prompt problem

When more text fits, teams often loosen governance because “the model can read it all.” That incentive is backwards. Governance must get stricter because you need to explain, after the fact, what the model saw, what was redacted, and which policy documents applied.

In practice, “prompt governance” has two audit questions your system must answer quickly, without re-running expensive retrieval or reconstructing prompts from logs that don’t exist:

What was eligible to enter the prompt?
What actually entered the prompt, and what changed between eligibility and inclusion?

NIST’s AI RMF frames governance and risk management as lifecycle activities, not a one-time checkbox. The ISO/IEC 23894 crosswalk to the NIST RMF (used for aligning risk practices) reinforces that risk handling should be systematic across phases, including information risk, transparency, and measurement. (Source). That’s exactly what prompt governance needs: a lifecycle record of which knowledge entered context and why.

Enterprise knowledge-work governance often includes data minimization, access control, redaction rules for sensitive fields, and auditability for content provenance. With one million token context, data minimization becomes “context minimization.” Access control becomes “who is allowed to have their authorized documents included in the prompt.” Redaction becomes “what gets removed before tokens are counted and cached.” Auditability becomes “what you log for each request,” including document identifiers and version hashes that prove lineage.

The missing detail most teams overlook is state. You’re not just logging “documents”--you’re logging transformations. A governance-compliant system should record:

Source identifiers (document ID, collection, effective date/version)
Selection rule (for example, “core memory allowlist,” “time-sensitive evidence,” or “user-tenant scoped”)
Redaction policy ID (the specific sanitizer template used)
Inclusion bounds (for example, character/token limits and truncation markers)
Cache key lineage (which cached artifacts were used, and under what invalidation rules)

Tool search adds another layer. If the model decides when to search internal tools, you need policy for what those tools can return and how results are logged. The EU’s regulatory framework for AI underlines that risk depends on system use and must be managed through appropriate obligations. Even without deep legal analysis, the engineering takeaway is straightforward: build logging and controls so the system can demonstrate responsible behavior. (Source).

So what: Make “prompt manifests” a required pipeline artifact. For every generation request, record document IDs, versions, redaction decisions, and retrieval rationale so audits don’t require reconstructing prompts from scratch.

Selective recall needs a better chunking model

Chunking breaks documents into pieces sized for efficient model use. Traditional chunking optimizes retrieval. With long-context windows, chunking must also optimize packing: include enough structure to preserve meaning, without pushing the prompt beyond practical limits for latency and cost.

A selective recall approach typically uses three layers:

A governance-approved “core memory” layer: stable knowledge such as approved internal standards, glossaries, and non-time-sensitive templates.
A retrieval “evidence layer”: RAG supplies the specific items that support the user’s question, with citations and logs.
A tool search layer: for live systems (tickets, dashboards, internal knowledge bases), the model requests narrow tool results rather than ingesting broad dumps.

OWASP’s LLM application guidance emphasizes that prompt construction and external tool use can introduce security issues, including injection risks. That makes chunking about more than quality--it’s about containment. Smaller, structured chunks make sanitization easier and help enforce allowed outputs from tools. (Source; Source).

Evaluation shifts too. If you pack more into context, your tests must include “context sensitivity” cases: the same question answered with slightly different prompt compositions shouldn’t degrade safety-critical outcomes. NIST’s AI RMF encourages risk measurement and monitoring, which in practice means building eval suites that vary input composition (including redacted vs non-redacted variants) and track regressions. (Source).

So what: Stop using one chunk size for everything. Use packing-aware chunking for your long-context core memory, and keep retrieval chunking for RAG evidence. Evaluate both dimensions because quality and risk can diverge.

Compaction, caching, and cost tradeoffs that hold

Prompt compaction reduces token usage while preserving intent and critical constraints. It can include summarizing, compressing repetitive instructions, removing irrelevant sections, and using structured formats instead of raw prose. A one million token window doesn’t remove the constraints on time, compute, and system throughput--so compaction still matters.

Caching turns big working sets into practical systems, but it can’t be treated as a performance hack. “Cache invalidation” isn’t a footnote; it’s an integrity control. To keep caching auditable and safe, treat cached prompt artifacts as versioned, permission-scoped assets with explicit invalidation triggers. Depending on your stack, cache at least one of the following:

Tokenized prompt segments (to skip repeated tokenization and formatting)
Sanitized or compacted document variants (post-redaction and post-truncation)
Structured manifests that expand into prompts (so selection logic stays deterministic)

Governance matters here. Caches must be invalidated on document version changes, policy updates, and access permission changes. Otherwise you can violate data governance even if your prompt manifest is correct at request time.

Tool search intersects with cost. If the model can search tools, you can avoid stuffing large corpora into context. You pay the bill instead in tool latency and system calls. The best systems treat tool search as targeted instrumentation: the model proposes a search, the system executes with strict filters, and final generation uses tool outputs plus a small context window. That aligns with minimizing unnecessary exposure while keeping evidence current.

Reason about the trade space by separating token spend from call spend. Long-context designs convert more work into “prompt tokens in, tokens out.” RAG and tool designs convert more work into “retrieval and tool calls.” Either can be expensive, and the right choice depends on where your system bottlenecks. Operationally, define:

A prompt budget (max input tokens you allow into the model, including core memory and system instructions)
A call budget (max tool calls per request, plus a hard timeout)
An evidence budget (max number or size of evidence items allowed in-context)

Tie these budgets to measurement. If prompt compaction and caching are working, you should see reduced p50/p95 latency at similar answer quality, plus stable compliance metrics (for example, zero “wrong version included” incidents in audits). Without measurement, “it feels faster” isn’t an engineering win.

OECD work on governing with AI stresses embedding governance across the AI system lifecycle. Even without legal advice, it maps cleanly to operational controls: change management for knowledge assets, monitoring for drift, and transparency about system behavior. (Source; Source).

So what: Engineer cost controls as first-class features: compaction for what you can cache, retrieval for what must remain fresh, and tool search for what should never be bulk-loaded.

What breaks as you approach the context limit

Engineers call it “diminishing returns,” when quality gains flatten as you add more context. With long-context windows, that flattening often comes with two practical failures: latency spikes and evidence dilution. Latency increases because the system must process more input. Evidence dilution happens when the model has to sift through too many competing statements and definitions.

OpenAI’s description of GPT-5.4 context capabilities is the starting point for what the model can accept, but it doesn’t mean your enterprise workflow should routinely pack near the limit. Plan for a safety margin so you can add tool outputs, disclaimers, and structured constraints without rebalancing the entire prompt.

Freshness vs context is another tension. Long context can preserve historical knowledge for continuity, but it can also trap you in outdated instructions if document versioning and retrieval triggers are weak. Your system must decide whether to answer “from memory” (core memory in-context) or “from evidence” (RAG/tool search). NIST AI RMF supports this decision because it treats risk management as lifecycle practice: you identify risks, implement controls, measure outcomes, and keep monitoring. (Source).

Tool search can also fail silently. If the model decides there’s enough context and skips tool search, it can produce stale or incorrect “best-effort” answers that look well-formed. You need evaluation coverage for refusal, retrieval-required prompts, and contradiction handling--tests that catch when the model should have searched.

So what: Add “context budgeting” into your prompt pipeline. Set hard budgets for core memory size, enforce retrieval triggers for time-sensitive queries, and evaluate latency and contradiction rates as you approach your maximum practical context.

Quantitative guardrails you can operationalize now

You need numbers, not vibes. These are measurable levers and guardrails you can derive from validated sources.

Context window ceiling: GPT-5.4 supports a one million token context window, meaning your request input cannot exceed that limit. It sets an absolute upper bound, even though it doesn’t specify the best operating point. (Source).
LLM application security scope: OWASP’s Top 10 for LLM applications is explicitly structured into multiple risk categories, and the versioned document provides a practical checklist dimension for evaluation. That means you can map long-context behaviors (prompt composition, retrieval, tool search) to OWASP’s risk categories when building tests and controls. (Source; Source).
Governance lifecycle framing: NIST AI RMF is documented as AI risk management framework material (AI RMF 1.0). The version matters because it’s guidance meant to be integrated into organizational processes and measurement plans, not treated as a general essay. (Source; Source).

Operationally, turn these sources into KPIs:

Latency per request vs prompt size (track prompt token count at generation time)
Citation fidelity or evidence match rate for RAG/tool outputs
Governance compliance rate for redaction and version validity (measured via prompt manifests)
Contradiction rate when retrieved evidence conflicts with in-context core memory

So what: Build an evaluation dashboard that logs input token counts, retrieval decisions, and governance artifacts. Use it to choose your maximum practical context size and your retrieval thresholds.

Real-world cases and what they imply

Case evidence can be partial, but documented outcomes do exist. Four cases show how enterprises (and governments) made governance or architecture choices that reflect the same engineering reality: systems must be auditable, and risks must be managed.

NIST AI RMF drives U.S. adoption

The U.S. has publicly centered AI risk management around NIST AI RMF 1.0 as a widely used reference point for how organizations manage AI risks through a lifecycle. Direct implementation details vary by organization, but the framework’s role as a de facto governance template is documented by NIST. (Source). Outcome: organizations can build internal processes for measurement, monitoring, and communication rather than ad hoc controls.

Timeline: NIST AI RMF 1.0 is a published framework intended for organizational use, and subsequent NIST ITL materials keep the framework operational. (Source).

ISO IEC 23894 aligns via NIST crosswalk

NIST published a crosswalk aligning ISO/IEC 23894 with the NIST AI RMF. Outcome: organizations can map external risk taxonomy into internal NIST-aligned lifecycle activities, which matters directly for long-context prompt governance (what you record, how you measure, how you monitor). (Source).

Timeline: the crosswalk is dated 2025 in the document path and is meant for alignment work, not one-off reading. (Source).

EU expectations shape engineering controls

The European Union’s AI regulatory framework outlines how obligations depend on AI system classification and use context. Outcome for practitioners: design must be capable of demonstrating risk management and compliance readiness, which in engineering practice means logging, transparency, and controlled tool usage for systems affecting users and decisions. (Source).

Timeline: this framework is part of the EU’s published regulatory approach, guiding system design decisions for organizations operating in or targeting EU markets. (Source).

OWASP Top 10 guides eval and security testing

OWASP released a versioned “Top 10 for Large Language Model Applications” document (v2025). Outcome: teams can build evaluation and security tests around concrete categories, including issues tied to prompt injection and unsafe tool interactions--problems that become more likely when you pack large context and allow tool search. (Source; Source).

Timeline: the OWASP page and associated versioned PDF are current guidance for LLM application risk categories, intended to be used for practical engineering controls. (Source).

So what: Treat governance frameworks as engineering test plans. Your long-context architecture should follow what these frameworks expect you to control and demonstrate, not what feels convenient during prototyping.

Selective recall policy, plus a timeline

Policy isn’t only for regulators. It’s what your internal architecture must enforce.

Recommendation for enterprise practitioners: Require a “selective recall” pattern with three enforced controls:

Context allowlist for core memory: only governance-approved document sets can be preloaded into one million token-capable inputs.
Evidence trigger rules: for questions involving time-sensitive facts, policy changes, or user-specific entitlements, require RAG or tool search before answering.
Audit-ready prompt manifests: every request logs document IDs, versions, redaction decisions, and tool outputs so you can reproduce what the model saw.

These controls align with NIST AI RMF lifecycle thinking and with security-evaluation guidance from OWASP for LLM applications. (Source; Source).

Forecast (next 12 months, from March 2026): Expect enterprises to operationalize “long-context where it’s safe” rather than “long-context everywhere.” By late 2026, most knowledge-work deployments should converge on smaller, governance-approved core memories (to prevent evidence dilution and stale knowledge), retrieval/tool search triggers for freshness and personalization, and evaluation suites that measure contradiction and latency as context size increases.

The reason is practical: one million tokens is a capacity feature, not a governance automation feature. Framework-driven governance and application-security controls are what make long-context usable at scale. (Source; Source).

So what: If you’re implementing GPT-5.4-class long-context now, design for selective recall, and let evidence triggers plus audit artifacts and eval coverage do the heavy lifting.

Sources

All Stories

One Million Tokens in GPT-5.4: Selective Recall, Governance Deadlines, and When RAG Still Wins

One million tokens is a ceiling

RAG still matters even with long context

Document governance becomes a prompt problem

In practice, “prompt governance” has two audit questions your system must answer quickly, without re-running expensive retrieval or reconstructing prompts from logs that don’t exist:

What was eligible to enter the prompt?
What actually entered the prompt, and what changed between eligibility and inclusion?

The missing detail most teams overlook is state. You’re not just logging “documents”--you’re logging transformations. A governance-compliant system should record:

Source identifiers (document ID, collection, effective date/version)
Selection rule (for example, “core memory allowlist,” “time-sensitive evidence,” or “user-tenant scoped”)
Redaction policy ID (the specific sanitizer template used)
Inclusion bounds (for example, character/token limits and truncation markers)
Cache key lineage (which cached artifacts were used, and under what invalidation rules)

Selective recall needs a better chunking model

A selective recall approach typically uses three layers:

A governance-approved “core memory” layer: stable knowledge such as approved internal standards, glossaries, and non-time-sensitive templates.
A retrieval “evidence layer”: RAG supplies the specific items that support the user’s question, with citations and logs.
A tool search layer: for live systems (tickets, dashboards, internal knowledge bases), the model requests narrow tool results rather than ingesting broad dumps.

Compaction, caching, and cost tradeoffs that hold

Tokenized prompt segments (to skip repeated tokenization and formatting)
Sanitized or compacted document variants (post-redaction and post-truncation)
Structured manifests that expand into prompts (so selection logic stays deterministic)

A prompt budget (max input tokens you allow into the model, including core memory and system instructions)
A call budget (max tool calls per request, plus a hard timeout)
An evidence budget (max number or size of evidence items allowed in-context)

So what: Engineer cost controls as first-class features: compaction for what you can cache, retrieval for what must remain fresh, and tool search for what should never be bulk-loaded.

What breaks as you approach the context limit

Quantitative guardrails you can operationalize now

You need numbers, not vibes. These are measurable levers and guardrails you can derive from validated sources.

Context window ceiling: GPT-5.4 supports a one million token context window, meaning your request input cannot exceed that limit. It sets an absolute upper bound, even though it doesn’t specify the best operating point. (Source).
LLM application security scope: OWASP’s Top 10 for LLM applications is explicitly structured into multiple risk categories, and the versioned document provides a practical checklist dimension for evaluation. That means you can map long-context behaviors (prompt composition, retrieval, tool search) to OWASP’s risk categories when building tests and controls. (Source; Source).
Governance lifecycle framing: NIST AI RMF is documented as AI risk management framework material (AI RMF 1.0). The version matters because it’s guidance meant to be integrated into organizational processes and measurement plans, not treated as a general essay. (Source; Source).

Operationally, turn these sources into KPIs:

Latency per request vs prompt size (track prompt token count at generation time)
Citation fidelity or evidence match rate for RAG/tool outputs
Governance compliance rate for redaction and version validity (measured via prompt manifests)
Contradiction rate when retrieved evidence conflicts with in-context core memory

Real-world cases and what they imply

NIST AI RMF drives U.S. adoption

Timeline: NIST AI RMF 1.0 is a published framework intended for organizational use, and subsequent NIST ITL materials keep the framework operational. (Source).

ISO IEC 23894 aligns via NIST crosswalk

Timeline: the crosswalk is dated 2025 in the document path and is meant for alignment work, not one-off reading. (Source).

EU expectations shape engineering controls

Timeline: this framework is part of the EU’s published regulatory approach, guiding system design decisions for organizations operating in or targeting EU markets. (Source).

OWASP Top 10 guides eval and security testing

Timeline: the OWASP page and associated versioned PDF are current guidance for LLM application risk categories, intended to be used for practical engineering controls. (Source).

Selective recall policy, plus a timeline

Policy isn’t only for regulators. It’s what your internal architecture must enforce.

Recommendation for enterprise practitioners: Require a “selective recall” pattern with three enforced controls:

Context allowlist for core memory: only governance-approved document sets can be preloaded into one million token-capable inputs.
Evidence trigger rules: for questions involving time-sensitive facts, policy changes, or user-specific entitlements, require RAG or tool search before answering.
Audit-ready prompt manifests: every request logs document IDs, versions, redaction decisions, and tool outputs so you can reproduce what the model saw.

These controls align with NIST AI RMF lifecycle thinking and with security-evaluation guidance from OWASP for LLM applications. (Source; Source).

So what: If you’re implementing GPT-5.4-class long-context now, design for selective recall, and let evidence triggers plus audit artifacts and eval coverage do the heavy lifting.

Trending Topics

Browse by Category

Sources

Keep Reading

AI Policy and the 1 Million-Token Reality: What GPT-5.4 Forces Your Systems to Change

One Million Tokens and the Enterprise Trap: How to Govern Long-Context AI Without Losing Accuracy

Context Governance for a Million-Token Era: Retrieval Rigs, Traceable Citations, and Agentic AI

Trending Topics

Browse by Category

One Million Tokens in GPT-5.4: Selective Recall, Governance Deadlines, and When RAG Still Wins

One million tokens is a ceiling

RAG still matters even with long context

Document governance becomes a prompt problem

Selective recall needs a better chunking model

Compaction, caching, and cost tradeoffs that hold

What breaks as you approach the context limit

Quantitative guardrails you can operationalize now

Real-world cases and what they imply

NIST AI RMF drives U.S. adoption

ISO IEC 23894 aligns via NIST crosswalk

EU expectations shape engineering controls

OWASP Top 10 guides eval and security testing

Selective recall policy, plus a timeline

Sources

One Million Tokens in GPT-5.4: Selective Recall, Governance Deadlines, and When RAG Still Wins

One million tokens is a ceiling

RAG still matters even with long context

Document governance becomes a prompt problem

Selective recall needs a better chunking model

Compaction, caching, and cost tradeoffs that hold

What breaks as you approach the context limit

Quantitative guardrails you can operationalize now

Real-world cases and what they imply

NIST AI RMF drives U.S. adoption

ISO IEC 23894 aligns via NIST crosswalk

EU expectations shape engineering controls

OWASP Top 10 guides eval and security testing

Selective recall policy, plus a timeline

Keep Reading

AI Policy and the 1 Million-Token Reality: What GPT-5.4 Forces Your Systems to Change

One Million Tokens and the Enterprise Trap: How to Govern Long-Context AI Without Losing Accuracy

Context Governance for a Million-Token Era: Retrieval Rigs, Traceable Citations, and Agentic AI