Benchmarks

LLM Wire Format Benchmark: Which Format Can AI Actually Read and Write? 2026-06-06

23 comprehension runs across 10 models (Claude Opus/Sonnet/Haiku, GPT-5.5/5.4/5.4-mini, Gemini 2.5 Flash/Pro, Gemini 3.1 Pro, Gemini 3.5 Flash). Generation eval across 11 models and 3 providers (Anthropic, OpenAI, Google). GCF wins 22, ties 1, loses 0 on comprehension. GCF achieves 5/5 valid generation on every frontier model with zero prior training. TOON fails 0/5 on generation with Opus, GPT-5.4, GPT-5.4-mini, Gemini 3.1 Pro, and Gemini 3.1 Flash Lite. JSON breaks on input at 500 symbols.

We Benchmarked the Most Popular Code Search Tools. We Beat All of Them. 2026-06-03

codegraph has 19K GitHub stars. GitNexus has 40K. Aider has 20K. We benchmarked 7 systems on 302 tasks across 17 codebases, 8 languages. knowing is 3.79x more precise than codegraph, 6.00x vs GitNexus, 6.35x vs Gortex, 22.0x vs grep. 13 self-adapting mechanisms that compound over time.

We Ran TOON's Own Benchmark. GCF Won. 2026-06-03

We inserted GCF into TOON’s benchmark harness. Same datasets, same tokenizer, same methodology. GCF uses 34% fewer tokens on mixed-structure data, matches TOON on flat data, and achieves 100% LLM comprehension accuracy where JSON fails at 66.7%.

Your AI Agent's Code Search Hits 2% of the Time. We Benchmarked It. 2026-05-22

Rigorous benchmark of AI agent code retrieval: 107 tasks, 5 repos, 5 languages, 4 competitors. grep precision: 2%. GitNexus: 7.6%. knowing: 23% (11.5x better, p<0.0001). Plus: 193x faster indexing, 28x less RAM, 48x more token-efficient than Repomix. The first statistically validated comparison of code intelligence tools for AI agents.

We Measured It: LSP Saves AI Agents 5-34x Tokens vs Grep 2026-05-03

We built a reproducible experiment measuring how many tokens AI coding agents consume when navigating code with grep vs LSP. On HashiCorp Consul (319K lines), LSP uses 34x fewer tokens. On a TypeScript rename across 24 files: 1,441x fewer bytes. The experiment covers 4 codebases, 3 languages, 13 tasks covering 7 agent workflows.