LLMs do not read words or numbers. They read tokens, compressed chunks of text that rarely align with human intuitions about letters, syllables, or digits. This single architectural fact explains why GPT-4o can draft a legal brief but struggles to count the letter R in “strawberry,” why Claude miscalculates digit-by-digit arithmetic, and why every model you have ever used has a hidden list of inputs that silently corrupt its reasoning before a single layer of the neural network fires.
Pithy Cyborg | AI FAQs – The Details
Question: Why does tokenization cause LLMs to fail at spelling, character counting, and arithmetic, and are there inputs that silently break reasoning in production models like GPT-4o and Claude?
Asked by: Gemini 2.0 Flash
Answered by: Mike D (MrComputerScience) from Pithy Cyborg.
How Tokenizers Silently Mangle Input Before the Model Sees Anything
Tokenization happens before reasoning. By the time a prompt reaches the attention layers of GPT-4o or Claude Sonnet 4.6, it has already been converted into a sequence of integer IDs corresponding to subword chunks in the model’s vocabulary. The model never sees your original text. It sees a lossy compressed representation of it.
OpenAI’s tiktoken tokenizer splits “strawberry” into three tokens: “straw,” “berry,” and a boundary marker. The model has no direct access to the individual letters. When you ask it to count the R’s, it is not scanning characters. It is doing something closer to inferring the answer from statistical patterns about how that token sequence behaves in its training data. Sometimes it gets it right. When it gets it wrong, it is not making a careless mistake. It is operating on a representation that structurally does not contain the information you assumed it did.
Numbers are worse. The number 9,472 might tokenize as a single token, or as “94” and “72,” or as four separate digit tokens, depending on the tokenizer and context. Arithmetic requires the model to reason about digit positions, but the tokenization boundary and the digit boundary are almost never aligned. The model is solving math on a number line that keeps shifting under its feet.
The Tokenization Artifacts That Break Production Reasoning Silently
The failure is not always visible. That is what makes tokenization artifacts operationally dangerous in production deployments.
In 2023, researchers discovered that certain token sequences caused GPT models to produce incoherent or looping outputs. The most documented case involved the token “SolidGoldMagikarp,” a username that appeared frequently enough in pre-tokenization training data that it received its own token ID, but rarely or never appeared in the actual fine-tuning data the model was trained to respond to. Prompting the model with that token produced undefined behavior: evasion, topic changes, and occasionally outputs that bypassed safety filters. This class of problem is called a glitch token, and every tokenizer-based model has them.
The practical production risk is subtler than glitch tokens. Long strings of repeated characters, certain Unicode sequences, mixed-script text, and uncommon punctuation combinations can all produce tokenization patterns the model was undertrained on. The output does not crash. It degrades. Reasoning quality drops in ways that are hard to catch without specifically testing for them, because the response is still fluent and confident.
RAG pipelines are particularly exposed. When retrieved document chunks get concatenated into a prompt, the tokenization boundary between chunks can create artifact sequences the model mishandles, producing summaries that subtly misrepresent the source material with no visible error signal.
When Tokenization Actually Works in Your Favor (And How to Route Around It When It Doesn’t)
Tokenization is not all liability. For natural prose in high-resource languages like English, French, and German, modern tokenizers like tiktoken and SentencePiece are extremely well-optimized. The subword chunks align well with morphological units, context is preserved, and the model’s training distribution covers the vast majority of inputs a typical user generates.
The failure modes cluster predictably. Character-level tasks (spelling, counting, anagrams), digit-level arithmetic without a code interpreter, uncommon proper nouns that straddle token boundaries, low-resource languages where the tokenizer uses character-level fallbacks, and structured data formats like CSV or JSON with irregular spacing all carry elevated artifact risk.
The practical routing strategy: if your use case requires character-level or digit-level precision, do not trust the base model. Use the code interpreter. GPT-4o’s code execution environment and Claude’s tool use both bypass tokenization artifacts for arithmetic and string manipulation by running actual Python rather than inferring the answer from token statistics. That is not a workaround. That is the architecturally correct solution for tasks the token representation was never designed to handle.
What This Means For You
- Never trust LLM character counts, letter frequencies, or digit-by-digit arithmetic without routing through a code interpreter, because the tokenizer has already discarded the character-level information the model would need to answer correctly.
- Test your production prompts for tokenization edge cases by running them through a tokenizer visualizer like Tiktokenizer before deployment, especially if your inputs include unusual punctuation, mixed scripts, or concatenated structured data.
- Treat degraded reasoning on uncommon proper nouns and technical strings as a tokenization signal, not a knowledge gap, and consider splitting or rephrasing inputs that straddle likely token boundaries awkwardly.
- Audit any RAG pipeline where retrieved chunks are concatenated directly into prompts, checking that chunk boundaries do not produce artifact token sequences that could silently distort the model’s summarization of source documents.
