Attention on Blackwell Systems

Attention on Blackwell Systemshttps://blog.blackwell-systems.com/tags/attention/Recent content in Attention on Blackwell SystemsHugoen-usSun, 21 Jun 2026 00:00:00 +0000Why LLMs Struggle with JSON at Scale: A Tokenization Analysishttps://blog.blackwell-systems.com/posts/json-tokenization-structural-ambiguity/Sun, 21 Jun 2026 00:00:00 +0000https://blog.blackwell-systems.com/posts/json-tokenization-structural-ambiguity/We benchmarked 8 tokenizers from 6 providers and performed exhaustive vocabulary scans. 15 of the most common JSON field names (id, name, type, value, title, time, text, url, path, description) exist as merged vocabulary entries (quote+field in one token) on GPT-4 (#32586 for ‘“name’), LLaMA, and Qwen. Claude and Gemma have zero such entries. These merges are hardcoded, deterministic, and irrecoverable without retraining the tokenizer. On real eval data: JSON boundary merge rate 8.93% vs GCF 1.00% (88.8% fewer). GPT-4 has 114 quote+letter vocabulary entries vs 17 pipe+letter (6.7:1 ratio). JSON overhead is 81% at scale. This structural ambiguity compounds per row and explains model-dependent comprehension failures across 2,400+ evaluations.