Research
Why LLMs Struggle with JSON at Scale: A Tokenization Analysis
We benchmarked 8 tokenizers from 6 providers and performed exhaustive vocabulary scans. 15 of the most common JSON field names (id, name, type, value, title, time, text, url, path, description) exist as merged vocabulary entries (quote+field in one token) on GPT-4 (#32586 for ‘“name’), LLaMA, and Qwen. Claude and Gemma have zero such entries. These merges are hardcoded, deterministic, and irrecoverable without retraining the tokenizer. On real eval data: JSON boundary merge rate 8.93% vs GCF 1.00% (88.8% fewer). GPT-4 has 114 quote+letter vocabulary entries vs 17 pipe+letter (6.7:1 ratio). JSON overhead is 81% at scale. This structural ambiguity compounds per row and explains model-dependent comprehension failures across 2,400+ evaluations.