AI models can read PDFs the way a person reads a photocopy of a handwritten fax: the underlying content may be there, but the extraction quality depends entirely on how the PDF was created, how it is structured, and whether the text is machine-readable or embedded in an image. Most PDF processing failures are not model failures. They are extraction failures that the model never had a chance to overcome.
Analysis Briefing
- Topic: PDF processing failure modes and extraction quality in LLM pipelines
- Analyst: Mike D (@MrComputerScience)
- Context: A structured investigation kicked off by Claude Sonnet 4.6
- Source: Pithy Cyborg
- Key Question: Why does AI give wrong answers about a PDF I uploaded when the answer is right there on page three?
The Three Types of PDFs and Why They Fail Differently
Text-native PDFs are created by software that embeds the text layer directly into the file. Word documents exported to PDF, reports generated by code, and documents created in design software with selectable text all produce text-native PDFs. These extract cleanly. A text-native PDF sent to an LLM via the API or uploaded to claude.ai produces accurate comprehension because the model receives the actual text content.
Scanned PDFs are images of physical pages with no embedded text layer. A contract signed on paper and scanned, a book chapter photographed and compiled into a PDF, and any document created by scanning physical paper produce image-only PDFs. These require OCR to extract text before the model can process it. If your PDF processing pipeline does not run OCR on scanned PDFs before sending them to the model, the model receives image data rather than text data. Its comprehension of image-only PDFs depends entirely on its vision capabilities, which are weaker than its text comprehension capabilities and fail specifically on dense text, small fonts, and complex layouts.
Hybrid PDFs contain both machine-readable text layers and images, tables, figures, and other visual elements that are not captured in the text layer. A research paper with embedded charts, a financial report with formatted tables, and a form with filled-in handwriting alongside printed text all produce hybrid PDFs where the text layer captures some content and the visual elements capture other content. The model comprehends the text layer content accurately and may miss or misinterpret the visual content depending on its extraction method.
Why Multi-Column Layouts, Tables, and Footnotes Break Extraction
PDF text extraction reads characters in the order they are stored in the file, not in the visual reading order a human would follow. In a single-column document, storage order and reading order are usually the same. In a multi-column document, they are not.
A two-column academic paper stores text in one of two ways: column by column, which produces readable extraction, or row by row across both columns, which produces interleaved text from both columns simultaneously. Row-by-row storage is common in PDFs created from word processors that typeset in columns. The extracted text alternates between the first column and the second column sentence by sentence, producing incoherent input for the model even though the original PDF looks perfectly formatted.
Tables are a similar problem. PDF tables are stored as positioned text elements without structural metadata indicating which cell each element belongs to. Extraction reads the positioned text in order and produces a flat text representation that loses the table structure. A financial table with numbers in cells becomes a stream of numbers with no indication of which row and column each number occupied.
Footnotes, headers, and watermarks add additional text to the extraction that the model must filter or interpret. A document with page headers that repeat the chapter title on every page produces a text extraction where the chapter title appears dozens of times interspersed with the actual content. The model processes this repeated text as context, which can bias its responses toward the repeated elements.
The Preprocessing Steps That Actually Fix PDF Comprehension
OCR as a preprocessing step converts scanned PDFs to text-native format before sending to the model. Tesseract, AWS Textract, and Google Document AI all provide OCR that is significantly more accurate than vision model text reading for dense document content. Running OCR on all PDFs before model processing eliminates the scanned PDF failure mode entirely.
Layout-aware extraction tools that understand PDF structure rather than just reading character positions produce significantly better text from multi-column layouts and tables. PyMuPDF with layout analysis, pdfplumber for table extraction, and Adobe’s PDF Extract API all provide structure-aware extraction that produces readable text from complex PDF layouts. The extracted text still requires postprocessing but is substantially better than character-position-order extraction.
Chunking at structural boundaries rather than character count boundaries prevents the model from receiving chunks that split mid-sentence across columns or mid-row across table cells. Structural extraction that identifies paragraph, section, and table boundaries and preserves them in the chunking step produces retrieval chunks that contain coherent semantic units rather than arbitrary character windows.
What This Means For You
- Detect PDF type before processing. Check whether your PDFs contain a text layer using a library like PyMuPDF. Scanned PDFs with no text layer need OCR before model processing. Sending image-only PDFs directly to an LLM produces poor comprehension regardless of model capability.
- Use layout-aware extraction for complex documents. Multi-column layouts, tables, and mixed-content PDFs require structure-aware extraction tools rather than basic text extraction. PyMuPDF, pdfplumber, and similar libraries produce dramatically better input than naive PDF-to-text conversion.
- Chunk at structural boundaries, not character count boundaries. PDF content should be chunked at paragraph, section, and table boundaries. Character-count chunking splits coherent content arbitrarily and degrades both retrieval precision and model comprehension.
- Verify extraction quality before assuming model failure. When an AI gives wrong answers about a PDF, extract the text yourself and read it before blaming the model. If the extracted text is incoherent, the problem is extraction. If the extracted text is correct, the problem may be the model.
Enjoyed this deep dive? Join my inner circle:
- Pithy Cyborg → AI news made simple without hype.
