AI models like GPT-4o and Claude write fluent essays because they predict token sequences statistically, a task they are trained on billions of examples to do. They fail at counting letters and words because they never see letters or words directly. They see tokens, which are arbitrary subword fragments with no numeric or alphabetic structure the model can reliably reason over.
Pithy Cyborg | AI FAQs – The Details
Question: Why can AI write a perfect essay but fail at counting words or letters?
Asked by: ChatGPT
Answered by: Mike D (MrComputerScience) from Pithy Cyborg.
Why Tokens Are Not Words and Why That Breaks Counting Completely
Before any AI model reads a single character of your prompt, a tokenizer converts your text into a sequence of numeric IDs.
Those IDs do not correspond to words. They correspond to fragments.
The word “playing” might become two tokens: “play” and “ing.” The word “tokenization” might become three. The number 380 might be one token, while 381 is two tokens: “38” and “1.” The word “HELLO” can become three tokens: “HE,” “EL,” and “O.”
The model never sees letters. It sees numbers representing chunks.
This is why asking GPT-4o to count the letter R in “strawberry” produces a wrong answer with full confidence. The model is not counting characters. It is pattern-matching over token IDs that have no consistent relationship to the letters inside them.
The TechCrunch analysis of this problem described it precisely: tokenizers destroy the relationships between digits and letters by treating surface-similar text as structurally different. 380 and 381 look adjacent to a human. To the tokenizer, they are completely different objects with no numerical relationship encoded.
The Spelling Miracle That Makes AI’s Essay Skills Even Weirder
Here is the part that should genuinely surprise you.
Given that AI models never see individual letters, the fact that they can spell correctly at all is considered remarkable enough in the research community to have a name: the spelling miracle.
Models learn spelling not by understanding letters, but by absorbing statistical patterns across billions of token sequences. They learn that certain token combinations almost always appear together. They learn what correct output looks like from the outside, without having direct access to the character-level structure that makes spelling work for humans.
This creates a bizarre asymmetry.
Ask Claude Sonnet 4.6 to write a grammatically flawless 800-word article. It will. Ask it to count how many words are in that article. It will probably be wrong.
The essay task is exactly what the model was trained to do: predict fluent token sequences. The counting task requires something the architecture does not natively support: an internal numeric counter that increments reliably as tokens are generated.
No such counter exists. The model approximates. And approximations fail at precise tasks.
When AI Counting Actually Works (And the Architecture Reason Why)
Reasoning models handle this better, and the reason is structural.
OpenAI’s o3 and Anthropic’s Claude thinking modes generate an explicit chain-of-thought before producing a final answer. That internal reasoning process gives the model something it normally lacks: a scratchpad where it can work through counting steps explicitly rather than pattern-matching to a likely answer in one pass.
This is why o3 can sometimes correctly identify the number of R’s in “strawberry” when standard GPT-4o fails. The chain-of-thought forces a step-by-step breakdown that partially compensates for the absence of character-level access.
But it is not a fix. It is a workaround.
The root problem is that transformers scale quadratically with sequence length, which makes character-level tokenization (processing every letter individually) computationally expensive enough that no major model does it. Research into byte-level architectures like MambaByte shows promise but remains early-stage and not yet competitive with frontier models at scale.
Until the architecture changes, counting is a fundamental weak point dressed up as a quirk.
What This Means For You
- Never trust an AI’s word count output without verifying in a separate tool, because the model generates text in tokens and has no reliable mechanism to count the words it just produced.
- Use reasoning models like o3 or Claude’s extended thinking mode for any task requiring precise counting, character identification, or letter-level analysis, since their chain-of-thought scratchpad partially compensates for the architecture gap.
- Expect AI to handle spelling and grammar well while failing at letter-counting specifically, because those are structurally different tasks that require different capabilities the model does not have equally.
- Paste critical outputs into a word processor or use
len()in Python to verify length constraints, rather than asking the model to count its own output — it is checking the same broken tool it used to write it.
Related Questions
- 1
- 2
- 3
