Mcp

Why LLMs Struggle with JSON at Scale: A Tokenization Analysis 2026-06-21

We benchmarked 8 tokenizers from 6 providers and performed exhaustive vocabulary scans. 15 of the most common JSON field names (id, name, type, value, title, time, text, url, path, description) exist as merged vocabulary entries (quote+field in one token) on GPT-4 (#32586 for ‘“name’), LLaMA, and Qwen. Claude and Gemma have zero such entries. These merges are hardcoded, deterministic, and irrecoverable without retraining the tokenizer. On real eval data: JSON boundary merge rate 8.93% vs GCF 1.00% (88.8% fewer). GPT-4 has 114 quote+letter vocabulary entries vs 17 pipe+letter (6.7:1 ratio). JSON overhead is 81% at scale. This structural ambiguity compounds per row and explains model-dependent comprehension failures across 2,400+ evaluations.

LLM Wire Format Benchmark: Which Format Can AI Actually Read and Write? 2026-06-06

23 comprehension runs across 10 models (Claude Opus/Sonnet/Haiku, GPT-5.5/5.4/5.4-mini, Gemini 2.5 Flash/Pro, Gemini 3.1 Pro, Gemini 3.5 Flash). Generation eval across 11 models and 3 providers (Anthropic, OpenAI, Google). GCF wins 22, ties 1, loses 0 on comprehension. GCF achieves 5/5 valid generation on every frontier model with zero prior training. TOON fails 0/5 on generation with Opus, GPT-5.4, GPT-5.4-mini, Gemini 3.1 Pro, and Gemini 3.1 Flash Lite. JSON breaks on input at 500 symbols.

We Benchmarked the Most Popular Code Search Tools. We Beat All of Them. 2026-06-03

codegraph has 19K GitHub stars. GitNexus has 40K. Aider has 20K. We benchmarked 7 systems on 302 tasks across 17 codebases, 8 languages. knowing is 3.79x more precise than codegraph, 6.00x vs GitNexus, 6.35x vs Gortex, 22.0x vs grep. 13 self-adapting mechanisms that compound over time.

We Ran TOON's Own Benchmark. GCF Won. 2026-06-03

We inserted GCF into TOON’s benchmark harness. Same datasets, same tokenizer, same methodology. GCF uses 34% fewer tokens on mixed-structure data, matches TOON on flat data, and achieves 100% LLM comprehension accuracy where JSON fails at 66.7%.

Your AI Agent's Code Search Hits 2% of the Time. We Benchmarked It. 2026-05-22

Rigorous benchmark of AI agent code retrieval: 107 tasks, 5 repos, 5 languages, 4 competitors. grep precision: 2%. GitNexus: 7.6%. knowing: 23% (11.5x better, p<0.0001). Plus: 193x faster indexing, 28x less RAM, 48x more token-efficient than Repomix. The first statistically validated comparison of code intelligence tools for AI agents.

The Code Intelligence Landscape: Context, Memory, and Proofs 2026-05-20

AI coding agents have a context problem. The tools solving it fall into four categories: context packers, code graphs, memory systems, and runtime observability. Each solves one piece. None versions the intelligence. None proves anything. None learns without poisoning itself over time. This article explores the landscape and argues that content-addressed code graphs with cryptographic proofs are the missing foundation.

What Git Did for Files, Applied to Code Relationships 2026-05-20

Git proved that content-addressing file contents gives you integrity, history, efficient equality, and distributed collaboration for free. The same architecture applied to code relationships gives you something new: versioned intelligence that you can diff, cache, prove, and trust over time.

We Measured It: LSP Saves AI Agents 5-34x Tokens vs Grep 2026-05-03

We built a reproducible experiment measuring how many tokens AI coding agents consume when navigating code with grep vs LSP. On HashiCorp Consul (319K lines), LSP uses 34x fewer tokens. On a TypeScript rename across 24 files: 1,441x fewer bytes. The experiment covers 4 codebases, 3 languages, 13 tasks covering 7 agent workflows.

We Tested 55 MCP Servers. Here's What Breaks. 2026-04-27

MCP servers are the tools AI agents rely on. We tested 55 of them with mcp-assert, found 20 bugs across 9 servers, and submitted fix PRs. Grafana and Ant Group merged ours. Three days after launch, Ant Group’s visualization team asked us to integrate mcp-assert into their CI. The most common failure: servers throw unhandled exceptions instead of returning isError, leaving agents unable to recover.

agent-lsp: Reliable Code Intelligence for AI Agents via MCP and LSP 2026-04-15

I needed AI agents to reliably rename symbols, find references, and check diagnostics without silent failures. The existing MCP-LSP tools were stateless, feature-poor, and untested. So I built agent-lsp: a persistent runtime with 50 tools, 20 provider-agnostic skills, speculative execution, and an audit trail for every AI-driven edit.