AI summarization models are trained to preserve high-frequency information and compress low-frequency information. The most important content in a document is frequently the least common: a single critical caveat, one exception to a general rule, the number that changes the conclusion. High importance and low frequency are anti-correlated in summarization training, which is exactly why the part that matters most is the first thing cut.
Analysis Briefing
- Topic: Frequency-importance anti-correlation in LLM summarization
- Analyst: Mike D (@MrComputerScience)
- Context: A back-and-forth with Claude Sonnet 4.6 that went deeper than expected
- Source: Pithy Cyborg
- Key Question: Why does the one thing you needed to know disappear in the summary?
Why Summarization Models Learn to Cut Low-Frequency Information
Summarization models are trained on human-written summaries of documents. Those training summaries reflect what human summarizers chose to preserve. Human summarizers preserve the main argument, the general conclusion, and the most frequently repeated claims. They compress exceptions, caveats, and single-occurrence details.
This training signal teaches the model a reliable heuristic: preserve what appears often, compress what appears rarely. That heuristic works well on most documents most of the time. It fails specifically on documents where the most important information appears exactly once.
A contract where 47 clauses establish standard terms and one clause contains a critical liability exception produces a summary that covers the standard terms and omits the exception. The standard terms appeared throughout the document. The exception appeared once. The model’s training tells it that frequently appearing information is more important than rarely appearing information. For this document, that training signal is exactly backwards.
The Document Types Where This Failure Is Most Dangerous
Legal documents are the highest-risk category. Contracts, terms of service, and regulatory filings are structurally designed to establish general rules and then specify exceptions. The exceptions are frequently the entire point of a legal review. An AI summary that captures the general rule and omits the exception produces a legally incorrect characterization of the document.
Medical and clinical documents are the second highest-risk category. Clinical trial summaries, drug interaction guides, and patient records often contain a primary finding and one critical contraindication. The contraindication appears once. The primary finding appears repeatedly. A summarization model that compresses the contraindication because it appears rarely produces a summary that is medically dangerous.
Financial disclosures follow the same pattern. A prospectus that describes an investment opportunity across fifty pages and includes one paragraph of material risk disclosure produces summaries that emphasize the opportunity and compress the risk. The risk paragraph appeared once. The opportunity narrative appeared fifty times.
How to Get Summaries That Preserve Critical Low-Frequency Information
Explicit instruction about what to preserve is the most reliable mitigation. Rather than asking for a general summary, specify the information type you need the summary to capture. “Summarize this contract and specifically preserve any exceptions, limitations, and liability clauses” instructs the model to override its default compression heuristic for that information category.
Extraction before summarization is more reliable than summary alone for high-stakes documents. Ask the model to extract all exceptions, caveats, conditions, and limitations as a separate step before summarizing. Extraction is a retrieval task rather than a compression task and does not apply the same frequency-based importance heuristic. The extracted exceptions can then be preserved explicitly in the summary.
Asking the model to identify what it compressed is an underused verification step. After receiving a summary, ask “what information from the original document did you not include in this summary?” The model’s answer surfaces the compressed content and allows you to evaluate whether the omissions include anything critical before acting on the summary.
What This Means For You
- Never use general summarization for legal, medical, or financial documents without specifying the information categories that must be preserved. Default summarization compresses low-frequency content, and critical information in these domains is typically low-frequency by design.
- Ask for extraction before summarization on high-stakes documents. Extract all exceptions, caveats, and conditions as a separate step, then summarize with those elements explicitly flagged for preservation.
- Ask what was omitted after every important summary. The model’s answer to “what did you not include?” surfaces compressed content that may contain the most important information in the document.
- Treat summary length as inversely correlated with exception preservation. The shorter the summary you request, the more aggressively the model applies its frequency heuristic. Longer summaries preserve more low-frequency content. For critical documents, request longer summaries than you think you need.
Enjoyed this deep dive? Join my inner circle:
- Pithy Cyborg → AI news made simple without hype.
