Why LLMs Struggle with JSON at Scale: A Tokenization Analysis

JSON's structural grammar tokenizes ambiguously across models. Merged tokens like '"name' (#32586) are hardcoded vocabulary entries on GPT-4, LLaMA, and Qwen. This is irrecoverable: frozen vocabulary, all weights depend on it. GCF's pipe has near-zero vocabulary merges. 8 tokenizers, 6 providers, exhaustive vocabulary scan.

Everyone knows JSON is verbose. The common explanation for why LLMs struggle with JSON at scale is “too many tokens.” That explanation is incomplete. The real problem is more subtle and more dangerous: JSON’s structural grammar tokenizes ambiguously across different models, and this ambiguity compounds with every row of data.

We ran a structural variance benchmark across 8 tokenizers from 6 providers. The findings explain a pattern we’ve observed across 2,400+ LLM evaluations: why JSON comprehension fails at scale, why it fails differently per model, and why no amount of prompt engineering can fix it.

Background: How BPE Tokenizers Handle Structured Data

Modern LLMs use Byte-Pair Encoding (BPE) tokenizers trained on large text corpora. BPE builds a vocabulary by iteratively merging the most frequent byte sequences. This creates a vocabulary optimized for natural language, not for structured data formats.

The critical property: BPE merging is context-dependent. The same character can be part of different tokens depending on what characters surround it. A quote character " might be its own token in one context and merge with adjacent characters in another.

For natural language, this is efficient (common words like “the” become single tokens). For structured formats like JSON, it creates a problem: the characters that mark structural boundaries (", :, {, }) can merge with the content they’re supposed to delimit.

A critical distinction: Any structured format contains two types of content: grammar symbols (delimiters that define structure) and payload content (the actual data values). A format designer controls grammar symbols but cannot control how payload content tokenizes without altering the data itself. The question isn’t “does everything tokenize consistently?” (it won’t, and can’t). The question is: do the structural boundaries always land at clean, unambiguous token positions? If yes, the model always knows where one field ends and the next begins, regardless of how the values themselves split.

This has been noted in passing by researchers. Deekeswar (2604.17512) measured that 1,000 JSON records consume ~80K tokens with the majority being repeated keys and punctuation. Karim and Batatia (2508.01685) explored structured tokenization for LLM training data. But nobody has performed a systematic mechanistic analysis of exactly how and where JSON’s structure breaks down at the BPE level.

The Experiment

We tested 8 tokenizers from 6 providers, representing every major LLM family in production:

TokenizerProviderModel FamilyVocab Size
Claude tokenizerAnthropicClaude 3.5, 4.x~100K
cl100k_baseOpenAIGPT-4100,256
o200k_baseOpenAIGPT-4o200,019
LLaMA 3.1 tokenizerMetaLLaMA 3.x128,256
Qwen 2.5 tokenizerAlibabaQwen 2.5151,936
DeepSeek V3 tokenizerDeepSeekDeepSeek V3128,000
Gemma 2 tokenizerGoogleGemma 2256,128
Mistral Nemo tokenizerMistralMistral/Ministral131,072

These tokenizers were each trained on different corpora with different merge priorities. Their disagreements on how to tokenize the same input reveal fundamental properties of that input’s structure.

We measured two things:

  1. Do structural delimiters tokenize consistently across models? If not, different models see different field boundaries for the same data.
  2. Do structural characters merge with adjacent content? If so, the model receives tokens that conflate structural markup with semantic content, making field boundaries ambiguous.

Finding 1: JSON Field Boundaries Tokenize Inconsistently

JSON uses the pattern "fieldName": to mark each field. This pattern repeats on every row of an array. We tested 155 common field names from production APIs across all 8 tokenizers.

15 of the most common field names in computing merge on half or more of all tokenizers:

FieldMerge rateModels affected
"id":63% (5/8)GPT-4, GPT-4o, LLaMA, Qwen, Mistral
"name":63% (5/8)GPT-4, GPT-4o, LLaMA, Qwen, Mistral
"time":63% (5/8)GPT-4, GPT-4o, LLaMA, Qwen, Mistral
"title":63% (5/8)GPT-4, GPT-4o, LLaMA, Qwen, Mistral
"type":50% (4/8)GPT-4, GPT-4o, LLaMA, Qwen
"value":50% (4/8)GPT-4, GPT-4o, LLaMA, Qwen
"url":50% (4/8)GPT-4, GPT-4o, LLaMA, Qwen
"user_id":50% (4/8)GPT-4, GPT-4o, LLaMA, Qwen
"text":50% (4/8)GPT-4, GPT-4o, LLaMA, Qwen
"path":50% (4/8)GPT-4, GPT-4o, LLaMA, Qwen
"description":50% (4/8)GPT-4, GPT-4o, LLaMA, Qwen
"in":50% (4/8)GPT-4, GPT-4o, LLaMA, Qwen
"is":50% (4/8)GPT-4, GPT-4o, LLaMA, Qwen
"encoding":50% (4/8)GPT-4, GPT-4o, LLaMA, Qwen
"dns":50% (4/8)GPT-4, GPT-4o, LLaMA, Qwen

These aren’t obscure fields. id, name, type, value, title, time, text, url, path, description appear in virtually every JSON API response. The affected model families (GPT-4/4o, LLaMA, Qwen) represent roughly half the LLM market.

What this means: At 500 rows with just id + name + type (and what payload doesn’t have these?), that’s 1,500 field boundaries where the majority of models see a hidden merge. Claude, DeepSeek, and Gemma keep all boundaries clean. GPT-4, GPT-4o, LLaMA, Qwen, and Mistral do not. This isn’t a one-time ambiguity; it compounds linearly with data size.

The worst case: 7 distinct tokenizations

We searched across 40 field names and 21 values to find maximum variance. The worst pattern:

"userName":"req_xyz789" produces 7 distinct tokenizations across 8 models:

GPT-4, LLaMA:     ["][userName][":"][req][_xyz][789]["]
GPT-4o:           ["user][Name][":"][req][_xyz][789]["]
Claude:           ["][userName][":"][req][_][xyz][789]["]
Qwen 2.5:        ["][userName][":"][req][_xyz][7][8][9]["]
DeepSeek V3:     ["][user][Name][":"][req][_][xyz][789]["]
Gemma 2:         ["][userName][":"][req][_][xyz][7][8][9]["]
Mistral Nemo:    ["][user][Name][":"][req][_x][yz][7][8][9]["]

Almost every model sees a structurally different token sequence for the same data. Note how GPT-4o merges the quote into ["user] while other models keep it separate.

Full objects: 4 different token counts

A complete JSON object {"orderId":"ORD-001","value":"shipped"} produces 4 different token counts depending on the model:

Token countModels
12 tokensGPT-4, LLaMA
13 tokensGPT-4o, Claude, DeepSeek
14 tokensQwen, Gemma
15 tokensMistral

The same JSON object is a different length on every model family. This means attention patterns, positional encodings, and context budget impact all vary per model for identical input data.

Finding 2: The Merge Mechanism

The variance in Finding 1 has a specific cause: BPE merging absorbs the opening quote into the field name.

When GPT-4’s tokenizer (cl100k_base) encounters "value":, it produces:

Token 1: "value    (quote + field name = one token)
Token 2: ":        (quote + colon = one token)

Claude’s tokenizer encounters the same string and produces:

Token 1: "         (quote alone)
Token 2: value     (field name alone)
Token 3: ":        (quote + colon)

The structural boundary lives in a different position. On GPT-4, the opening quote is fused with the content. On Claude, it’s separate. The model must learn to decompose the merged token "value into “this is a quote character followed by a field name” rather than treating it as a single semantic unit.

Why this happens

BPE vocabularies are built from training data statistics. If "value appears frequently in the training corpus (it does, because JSON is everywhere in code), the tokenizer learns it as a single merge. Tokenizers trained on different corpora (or with different vocabulary sizes) reach different merge decisions for the same character sequences.

This is well-studied for natural language (Liyanage & Yvon, 2601.21665, on post-training tokenizer adaptation). But the implications for structured data are underexplored: when the merge boundary falls on a structural delimiter, the result is a token that conflates syntax and semantics.

The merge pattern at field-to-value boundaries

We tested JSON’s most structurally critical pattern: the complete field-to-value transition "field":"data":

"value":"hello"

GPT-4, GPT-4o, LLaMA, Qwen (4 tokenizers):
  ["value] [":"] [hello] ["]

Claude, DeepSeek, Gemma, Mistral (4 tokenizers):
  ["] [value] [":"] [hello] ["]

On half of all tokenizers, the field name is fused into the opening quote. The model sees a single token where there should be a structural boundary.

For "name":"Alice":

GPT-4, GPT-4o, LLaMA, Qwen, Mistral (5 tokenizers):
  ["name] [":"] [Alice] ["]

Claude, DeepSeek, Gemma (3 tokenizers):
  ["] [name] [":"] [Alice] ["]

Five of eight tokenizers merge "name into a single token. The field boundary is invisible at the token level on these models.

Finding 3: GCF Grammar Merges 88.8% Less

For comparison, we tested all 10 characters in GCF’s grammar against all 8 tokenizers. 80 individual checks.

CharacterPurposeClaudeGPT-4GPT-4oLLaMAQwenDeepSeekGemmaMistral
|Field delimiter11111111
@Symbol ID prefix11111111
<Edge direction11111111
##Section header11111111
\nRow separator11111111
{Schema open11111111
}Schema close11111111
[Count open11111111
]Count close11111111
,Schema separator11111111

80 checks. Zero exceptions. Every GCF structural character is always exactly 1 token on every tokenizer.

We then verified that these characters never merge with adjacent content. We tested 15 realistic field+value patterns across all 8 tokenizers (120 additional checks):

value|pending            → [value][|][pending]          ALL 8 tokenizers
name|Alice               → [name][|][Alice]             ALL 8 tokenizers
orderId|ORD-001          → [orderId][|][ORD][-][001]    ALL 8 tokenizers
userName|john            → [userName][|][john]           ALL 8 tokenizers
email|alice@example.com  → [email][|][alice][@]...      ALL 8 tokenizers
score|95.5               → [score][|][95][.][5]         ALL 8 tokenizers

On real-world eval data (14 field names, 25 values, 2,800 checks per format):

FormatBoundary merge rateCause
JSON8.93%Field names ("id":, "name":) merge on 62.5% of tokenizers
GCF1.00%One value (cancelled) triggers merge on 25% of tokenizers

GCF has 88.8% fewer boundary merges on real data.

The critical difference: JSON’s merges are caused by field names which repeat on every row (compounding at scale). GCF’s merges are caused by a rare value (appearing occasionally). At 500 rows with "id" and "name" fields, JSON has ~625 hidden boundaries. GCF has a handful.

Under adversarial conditions (values starting with /, ., -), GCF’s pipe can merge on some tokenizers (e.g., [|/] on LLaMA and Gemma). Even then, the pipe is at the start of the merged token (boundary position identifiable). In JSON, the quote is at the start with the field name after it (["value]), hiding the boundary inside.

No delimiter is perfect against all possible right-contexts. But GCF’s grammar characters merge at significantly lower rates than JSON’s, and when they do merge, the boundary remains at the token start rather than hidden inside.

Why GCF’s delimiters are safe

This isn’t accidental. We analyzed all 94 printable ASCII characters (codes 33-126) across all 8 tokenizers on two criteria:

  1. Does it encode as exactly 1 token in isolation?
  2. Does it never merge with adjacent text?

74 of 94 characters satisfy both criteria. The 20 characters that fail include . (merges into .validate), - (merges into -based), _ (merges into _name), and common lowercase letters. These are exactly the characters that appear in JSON’s structural patterns (dots in qualified names, underscores in field names, dashes in UUIDs).

GCF’s grammar was designed using only characters from the safe set. The format’s tokenization stability is a deliberate design choice, not a lucky accident.

Finding 4: The Root Cause Is in the Vocabulary

Everything above describes WHAT happens (merging) and WHERE (which fields, which models). This finding explains WHY and proves it’s irrecoverable.

BPE tokenizers have a fixed vocabulary: a lookup table mapping strings to integer IDs. When the tokenizer encounters input, it greedily selects the longest matching vocabulary entry. If "name exists as entry #32586, it will always be selected as a single token. This is not a context-dependent decision. It’s a dictionary lookup.

We scanned every entry in all 8 tokenizer vocabularies:

TokenizerVocab sizeQuote+letter entriesPipe+letter entriesRatio
GPT-4 (cl100k)~100K114176.7:1
GPT-4o (o200k)~200K86614.3:1
Claude~65K00clean
LLaMA 3.1~128K114186.3:1
Qwen 2.5~131K114176.7:1
DeepSeek V3~128K42410.5:1
Gemma 2~256K00clean
Mistral Nemo~131K31310.3:1

GPT-4 has 114 vocabulary entries where a quote is fused with a following word. Claude and Gemma have zero. This is why Claude handles JSON boundaries cleanly and GPT-4 doesn’t: the merged token literally does not exist in Claude’s dictionary.

Actual token IDs

These are not hypothetical. These are dictionary entries with specific IDs:

FieldGPT-4GPT-4oLLaMAQwenClaudeGemma
"name#32586#74800#32586#31486
"id#29800#60094#29800#28700
"type#45570#91290#45570#44470
"value#64407#180654#64407#63307
"title#83827#187286#83827#82727
"description#69093#150676#69093#67993

Cross-verified: we encoded "name":"Alice" and confirmed token #32586 appears in GPT-4’s output. The entries are active, not dead vocabulary.

Why these entries exist

JSON is one of the most common data formats in LLM training corpora. Every GitHub repo has package.json. Every API doc shows JSON examples. Every Stack Overflow answer demonstrates JSON parsing. The byte sequence "name appeared billions of times in training data, so the tokenizer learned it as a high-frequency merge and added it to the vocabulary.

This is efficient for compression (fewer tokens for common patterns). But it creates structural ambiguity: the grammar symbol (") and the payload content (name) become one token, and the model cannot see inside a token to decompose it.

The training familiarity paradox

The conventional wisdom is that LLMs “know” JSON best because they’ve been trained on more JSON than any other structured format. This is true at the model level: the transformer weights have learned JSON’s semantics from billions of examples. But at the tokenizer level, the opposite happens: the more JSON the tokenizer saw during training, the more aggressively it merged JSON patterns, and the more structural boundaries it hid.

The models that saw the MOST JSON have the WORST JSON boundaries:

  • GPT-4 (massive code corpus): 114 merged quote+field entries
  • LLaMA (large code mix): 114 merged entries
  • Claude (different tokenizer strategy): 0 merged entries

The training familiarity didn’t create structural understanding. It created compression. The tokenizer optimized for representing JSON in fewer tokens, which is exactly what a compression algorithm should do. But compression hides structure. The quote and the field name became one token because that’s more efficient for storage. It’s less efficient for comprehension.

This inverts the standard argument entirely. “Trained on JSON” is not an advantage for structural comprehension at scale. It’s the mechanism that causes structural ambiguity. The tokenizer’s efficiency is the model’s handicap.

Why Claude and Gemma don’t have this problem

Claude’s tokenizer has zero quote+letter entries. Gemma’s has zero. The specific tokenizer training details are proprietary, but measurable differences explain the divergence:

  • Vocabulary size: Claude uses ~65K entries (smallest tested). Smaller vocabularies are more conservative about which merges to include. GPT-4’s 100K has budget for specialized merges like "name.
  • Training data mix: Less code/JSON in the training corpus means "name appears less frequently, making it less likely to cross the merge threshold.
  • Merge boundary policy: BPE training can be configured to treat certain characters as merge barriers. Anthropic and Google may have prevented " from merging with adjacent letters.

Gemma’s vocabulary is the largest (256K) yet has zero quote merges. Larger vocabulary doesn’t mean more merges. The merge policy matters more.

Why this is irrecoverable

  1. Vocabulary is frozen. Once the tokenizer is trained, entries never change. Fine-tuning adjusts weights, not vocabulary.
  2. All weights depend on the vocabulary. Token #32586 has a learned embedding. Removing it would break every layer.
  3. Tokenization is pre-model. The merge happens before the transformer processes the input. The model receives integer IDs, not characters.
  4. Retraining the tokenizer requires retraining the model. New vocabulary means new embeddings, new attention patterns. Full retrain from scratch.

No amount of prompt engineering, fine-tuning, or RLHF can fix this. The structural boundary between " and name is invisible to GPT-4 because token #32586 exists in its dictionary. It will always exist. The only fix is a format whose grammar characters don’t appear as merged entries in tokenizer vocabularies.

What about GCF’s pipe?

The pipe has a small number of merged entries (17 on GPT-4), but they’re with programming keywords (|null, |string, |max, |min, |required) from TypeScript/Go type union syntax. |name, |id, |type, |value never exist as vocabulary entries on any tokenizer. The pipe merges with type-system keywords, not with the field names that matter for structured data.

Finding 5: JSON Overhead is 81%, Growing Linearly

Beyond structural ambiguity, JSON also burns the majority of its tokens on non-data content. We measured where tokens go in a 500-row frequency table (4 fields: field, value, count, percentage):

JSON token distribution (500 rows, GPT-4o tokenizer)

CategoryTokens% of totalGrowth
Repeated field names ("field":, "value":, etc.)5,50052.4%Linear (11 per row)
Structural characters ({, }, [, ], :, ,)3,00128.6%Linear (6 per row)
Actual data values1,99519.0%Linear (content-dependent)
Total10,496

81% of JSON’s tokens carry zero new information after the first row. The field names "field":, "value":, "count":, "percentage": are declared on row 1 and then repeated identically 499 more times.

GCF token distribution (same data)

CategoryTokens% of totalGrowth
Header (field names, declared once)100.2%Constant
Data rows6,50099.8%Linear (content only)
Total6,510

GCF declares field names once in the header (## [500]{field,value,count,percentage}), then emits rows with zero structural repetition. The ratio of useful-to-total tokens is 99.8%.

Per-field cost analysis

Each JSON field-name pattern costs tokens on every row:

Field patternTokens per occurrence× 500 rowsTotal cost
"field":3× 5001,500
"value":2 (GPT-4o) to 3 (Claude)× 5001,000-1,500
"count":3× 5001,500
"percentage":3× 5001,500
Total per row11× 5005,500

In GCF, all four field names cost 10 tokens total (once, in the header).

The ratio at scale

RowsJSON overhead (field names + structural)GCF overhead (header)Ratio
10171 tokens10 tokens17:1
50851 tokens10 tokens85:1
1001,701 tokens10 tokens170:1
5008,501 tokens10 tokens850:1
1,00017,001 tokens11 tokens1,545:1

At 1,000 rows, JSON burns 17,001 tokens on structural overhead. GCF uses 11. The gap grows without bound because JSON’s overhead is O(n) per row while GCF’s is O(1).

Finding 6: Cross-Tokenizer Consistency

The overhead pattern is not an artifact of one tokenizer. All 8 confirm it:

TokenizerJSON tokensGCF tokensSavingsJSON field-name overhead
Claude (Anthropic)10,9967,01336.2%54.6%
GPT-4 (OpenAI cl100k)10,4946,50838.0%52.4%
GPT-4o (OpenAI o200k)10,4946,50838.0%52.4%
LLaMA 3.1 (Meta)10,4946,50838.0%52.4%
Qwen 2.5 (Alibaba)13,1509,16630.3%41.8%
DeepSeek V310,4946,50938.0%57.2%
Gemma 2 (Google)14,1499,66931.7%42.4%
Mistral Nemo13,6499,16732.8%44.0%

Every tokenizer shows JSON spending 42-57% of its total tokens on repeated field names. The absolute numbers vary (Gemma uses more tokens overall due to smaller subword merges), but the proportional waste is consistent.

The full savings picture

The overhead analysis above uses a flat frequency table. Savings increase with data complexity and session reuse:

ScenarioGCF vs JSON (pretty)What drives it
Generic profile (flat/nested, 500 orders)50-59%Header factorization, inline schemas
15-dataset benchmark (mixed real payloads)43-65%Data complexity determines savings
Graph profile (500 symbols + 200 edges)63-69%@id refs, edge encoding, section headers
Session dedup (90% overlap, call 3 of 5)89-90%Bare references for previously-seen symbols
Session dedup (full 5-call session total)84.3%Format + dedup combined

In a real agent session with repeated tool calls to the same codebase, cumulative savings reach 84-92%. JSON has no deduplication mechanism; every call retransmits the full payload. GCF’s bare references (@7 = 2 tokens vs full declaration = 19 tokens) mean subsequent calls cost a fraction of the first.

All numbers cross-tokenizer validated across 8 tokenizers from 6 providers.

Why Merging Causes Higher Degradation at Scale

At 10 rows, "name being one token instead of two doesn’t matter. The model has seen enough JSON to handle it. There are only 10 merged boundaries. The attention mechanism can work around it.

At 500 rows, three problems compound simultaneously:

1. The merged boundary repeats 500 times. Each row contains "name":, "id":, "type":. That’s ~1,500 positions where the structural boundary is inside a merged token. The model must decompose structure from inside merged tokens at 1,500 positions, not 10.

2. All 1,500 positions are identical token sequences. The token for "name on row 1 is the same integer (#32586) as on row 500. The model can’t distinguish them. It relies on positional encoding alone to track “which "name am I looking at?” Positional encoding degrades over long sequences.

3. 81% of the sequence is noise. The repeated field names and braces are not just merged; they’re also redundant. The attention mechanism is spread across ~8,500 tokens that carry no information, trying to find the ~2,000 that do. The merged boundaries make the noise harder to skip because the model can’t cleanly identify where structure ends and data begins.

The compounding is the critical insight. At 10 rows: manageable. At 500 rows: 1,500 merged boundaries, massive noise, positional encoding stretched, attention diluted. The model stops finding precise answers and guesses. This is why JSON errors at scale are off by 50-140 (couldn’t find the answer), not off by 1-2 (slightly misread a number).

GCF at 500 rows: zero merged boundaries on field names, 99.8% signal, structure answers questions directly (## related [167]). Nothing compounds because there’s nothing to compound.

Attention dilution in detail

Self-attention allocates a fixed budget across all input positions. When 80% of the sequence is structural noise, the budget is diluted across positions that carry no information.

Ildiz et al. (2402.13512) proved mathematically that self-attention weights tokens proportionally to their frequency in the sequence. Their Context-Conditioned Markov Chain (CCMC) formulation shows that the probability of attending to token j includes m_j (the count of token j in the sequence) in the numerator. Tokens that appear more often receive more attention weight purely by count, not by relevance. In a 500-row JSON array, structural tokens ("name":, "id":, {, }, :) account for ~80% of all token occurrences. The CCMC formula means these tokens dominate the attention budget, and the actual data values (semantically important but numerically outnumbered) receive proportionally less. This is not a hypothesis; it’s a mathematical property of the self-attention mechanism. (The paper analyzes single-layer models; multi-layer architectures partially mitigate this, but our comprehension data shows the mitigation is insufficient at 500+ rows.)

Consider the task “how many records have status = shipped?” given 500 JSON objects. The model must attend to every "status": pattern (500 occurrences), read the following value, compare to “shipped,” and count. The 500 "status": patterns produce the same tokens every time. The model has no structural marker distinguishing the 150th from the 350th. It relies on positional encoding alone.

In GCF, the equivalent task requires attending to a column of values with pipe delimiters at known, consistent positions. No ambiguity. No repetition competing for attention.

Counter-Argument: Training Distribution

The strongest counter-argument comes from Kutschka & Geiger (2605.29676) and Matveev (2603.03306): models have seen enormous amounts of JSON during training. This familiarity might compensate for structural inefficiency. JSON is overrepresented in code corpora, and models may have internalized parsing logic that handles merged tokens correctly.

Our response, supported by our evaluation data:

  1. Training familiarity helps at small scale. All formats achieve near-100% accuracy at 10-50 records. The model has seen enough JSON to parse small payloads easily.

  2. Familiarity fails at scale. At 500 records with complex structure, JSON drops to 53.4% accuracy across 10 models. No amount of training data exposure compensates for the attention dilution of 8,000+ noise tokens.

  3. GCF achieves 100% with zero training exposure. No model has ever been trained on GCF. Yet every frontier model comprehends it perfectly on standard workloads and achieves 91.2% on structurally complex data. This proves format structure matters more than training distribution once you’re past the complexity threshold.

  4. The threshold is lower than people think. Matveev’s “scaling hypothesis” suggests formats only separate past a complexity threshold. Our data shows that threshold is ~100-200 records for nested data and ~500 for flat tables. Most production tool responses exceed this.

The Structural Variance Hypothesis

This analysis suggests a testable hypothesis for the model-dependent failures we observe:

ModelJSON accuracy (stress)Tokenizer merges "value?Prediction
Claude Opus 4.673.1%No (3 separate tokens)Better JSON performance
Claude Sonnet 4.653.8%NoSmaller model, attention budget matters more
GPT-5.545.8%Likely (consistent with o200k patterns)Merged boundaries hurt at scale
GPT-5.444.1%Likely (deterministic errors match o200k merges)Always 198 vs 200
Gemini 2.5 Pro58.3%No (Gemma tokenizer)Better than GPT but still fails on counting

The pattern is suggestive: models whose tokenizers merge field-name patterns tend to perform worse on JSON comprehension at scale. This is not proof of causation (model capability differences are a confound), but it’s consistent with the structural ambiguity mechanism.

On GCF, all of these models achieve 85-100%. The tokenizer differences don’t matter because GCF’s delimiters are always unambiguous.

Implications

For format designers

Choose structural delimiters from the “never-merge” character set. Our ASCII space analysis found 74 safe characters and 20 unsafe ones. The unsafe set includes exactly the characters that appear in JSON’s grammar:

  • . (dot): merges into .validate, .com, .json
  • - (dash): merges into -based, -style, -token
  • _ (underscore): merges into _name, _id, _count
  • Lowercase letters: merge into subword prefixes

Design principle: a format’s structural characters should be chosen from the set of characters with the lowest merge rates across tokenizers. No character is perfect against all possible adjacent content, but the difference between JSON’s 8.93% merge rate and GCF’s 1.00% is the difference between 1,500 hidden boundaries at scale and a handful.

For tool builders

If your MCP server or AI tool outputs JSON arrays to LLMs:

  • At 10 records: JSON is fine. All formats work.
  • At 100 records: consider alternatives. JSON overhead is 1,701 tokens of noise.
  • At 500+ records: JSON’s comprehension failures are measurable. 53.4% accuracy means your agent gets the wrong answer nearly half the time.

The fix is straightforward: declare field names once, emit positional rows, use non-merging delimiters. This is what GCF does.

For agent architects

If you’re building multi-model systems:

  • JSON produces different token sequences on different models for the same data
  • This means each model sees a structurally different representation of the same input
  • For consistency-critical applications (multi-model voting, fallback chains), this variance matters
  • GCF produces identical structural boundaries on every model, eliminating this variable

For researchers

This analysis opens several directions:

  1. Causal testing: Does artificially introducing merged tokens at field boundaries directly cause comprehension errors? (Controlled experiment with custom tokenizer)
  2. Attention visualization: Do attention maps show different patterns at merged vs. separate boundary tokens?
  3. Format-aware fine-tuning: Can models be fine-tuned to handle merged boundary tokens better? (And is that easier or harder than just using a better format?)
  4. Optimal grammar search: Given a tokenizer vocabulary, what is the mathematically optimal delimiter set? (Minimize total tokens while maximizing boundary consistency)
PaperKey contributionRelation to our findings
Deekeswar, “ONTO” (2604.17512)1,000 JSON records = ~80K tokens, majority overheadQuantifies the problem we explain mechanistically
Karim & Batatia (2508.01685)Innovative tokenisation of structured data for LLM trainingHybrid approach; GCF achieves similar result via grammar design
Kutschka & Geiger (2605.29676)Notation Matters: token-optimized format benchmarkWe show GCF avoids the accuracy tradeoff via unambiguous delimiters
Ildiz et al. (2402.13512)Self-attention weights tokens proportionally to frequency (CCMC proof)Mathematical basis: repeated structural tokens dominate attention budget by count
Sui et al. (2305.13062)Table format affects LLM performanceGeneral finding we explain at the BPE level
Matveev (2603.03306)TOON vs JSON benchmark with constrained decodingConfirmed: formats separate past ~200 records

Reproduce

All experiments are reproducible from one command:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
git clone https://github.com/blackwell-systems/gcf
cd gcf

# Structural variance benchmark (8 tokenizers, merge analysis)
node eval/structural-variance.mjs

# Common business field analysis (155 fields, 15 worst offenders)
node eval/common-field-merge-analysis.mjs

# Worst-case JSON tokenization search (maximum variance patterns)
node eval/worst-json-tokenization.mjs

# JSON overhead analysis (token distribution, scaling)
node eval/json-tokenization-analysis.mjs

# Full tokenizer variance analysis (8 tokenizers, multiple scales)
node eval/tokenizer-variance.mjs

# Vocabulary entry analysis (root cause: merged tokens are dictionary entries)
node eval/tokenizer-vocabulary-analysis.mjs

# Full vocabulary scan (exhaustive: every entry in every vocabulary)
node eval/vocabulary-full-scan.mjs

# Grammar swap experiment (proves savings are structural, not delimiter-specific)
node eval/grammar-swap-experiment.mjs

The comprehension evaluation (2,400+ LLM calls proving these findings correlate with actual model behavior) is at github.com/blackwell-systems/gcf-go/tree/main/eval .

Summary

MetricJSONGCF
Boundary merge rate (real eval data)8.93% (2,800 checks)1.00% (2,800 checks)
Merge causeField names (repeat per row)Rare value (occasional)
Worst single-field merge rate62.5% ("id":, "name":)25% (one value: cancelled)
Quote+letter vocabulary entries (GPT-4)114
Pipe+letter vocabulary entries (GPT-4)17 (none are field names)
Root causeHardcoded vocab entries ("name=#32586)No field-name merges in any vocab
Fixable?No (frozen vocabulary, all weights depend on it)N/A
Tokens spent on overhead (500 rows)81%0.2%
Overhead scalingO(n) per rowO(1) constant
Signal-to-noise ratio19% signal99.8% signal
Comprehension at 500 records (stress)53.4%91.2%
Comprehension on standard workloads100% (frontier)100% (frontier)

JSON was designed in 2001 for human-readable data interchange between web browsers and servers. Its structural choices (quotes around keys, colons as separators, repeated field names) made sense for that era. They predate BPE tokenizers by over a decade. They predate transformer attention by 16 years.

The tokenization analysis shows the problem isn’t just token count. It’s structural ambiguity. Different models see different boundaries. The attention mechanism is diluted by repetition. And both problems compound linearly with data size.

For LLM systems processing structured data at scale, the format matters. Not just for cost, but for correctness.