<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Attention on Blackwell Systems</title><link>https://blog.blackwell-systems.com/tags/attention/</link><description>Recent content in Attention on Blackwell Systems</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Sun, 21 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.blackwell-systems.com/tags/attention/index.xml" rel="self" type="application/rss+xml"/><item><title>Why LLMs Struggle with JSON at Scale: A Tokenization Analysis</title><link>https://blog.blackwell-systems.com/posts/json-tokenization-structural-ambiguity/</link><pubDate>Sun, 21 Jun 2026 00:00:00 +0000</pubDate><guid>https://blog.blackwell-systems.com/posts/json-tokenization-structural-ambiguity/</guid><description>We benchmarked 8 tokenizers from 6 providers and performed exhaustive vocabulary scans. 15 of the most common JSON field names (id, name, type, value, title, time, text, url, path, description) exist as merged vocabulary entries (quote+field in one token) on GPT-4 (#32586 for &amp;lsquo;&amp;ldquo;name&amp;rsquo;), LLaMA, and Qwen. Claude and Gemma have zero such entries. These merges are hardcoded, deterministic, and irrecoverable without retraining the tokenizer. On real eval data: JSON boundary merge rate 8.93% vs GCF 1.00% (88.8% fewer). GPT-4 has 114 quote+letter vocabulary entries vs 17 pipe+letter (6.7:1 ratio). JSON overhead is 81% at scale. This structural ambiguity compounds per row and explains model-dependent comprehension failures across 2,400+ evaluations.</description></item></channel></rss>