Posts

20 of 68 total posts (showing page 1 of 4)

LLM Wire Format Benchmark: Which Format Can AI Actually Read and Write?

23 comprehension runs across 10 models (Claude Opus/Sonnet/Haiku, GPT-5.5/5.4/5.4-mini, Gemini 2.5 Flash/Pro, Gemini 3.1 Pro, Gemini 3.5 Flash). Generation eval across 11 models and 3 providers (Anthropic, OpenAI, Google). GCF wins 22, ties 1, loses 0 on comprehension. GCF achieves 5/5 valid generation on every frontier model with zero prior training. TOON fails 0/5 on generation with Opus, GPT-5.4, GPT-5.4-mini, Gemini 3.1 Pro, and Gemini 3.1 Flash Lite. JSON breaks on input at 500 symbols.

We Benchmarked the Most Popular Code Search Tools. We Beat All of Them.

codegraph has 19K GitHub stars. GitNexus has 40K. Aider has 20K. We benchmarked 7 systems on 302 tasks across 17 codebases, 8 languages. knowing is 3.79x more precise than codegraph, 6.00x vs GitNexus, 6.35x vs Gortex, 22.0x vs grep. 13 self-adapting mechanisms that compound over time.

We Ran TOON's Own Benchmark. GCF Won.

We inserted GCF into TOON’s benchmark harness. Same datasets, same tokenizer, same methodology. GCF uses 34% fewer tokens on mixed-structure data, matches TOON on flat data, and achieves 100% LLM comprehension accuracy where JSON fails at 66.7%.

We Scanned 300 npm and PyPI Packages for Supply Chain Attacks Without Executing a Single Line of Code

We indexed 300 popular packages with knowing’s code graph, computed isolation scores based on credential access + process spawning patterns, and achieved a 1.0% false positive rate across both the initial 200 and a held-out 100. No sandbox. No execution. No heuristics. Just graph structure.

14,000 Python Developers Installed My Go Binary via pip. Here's How.

Your Go CLI tool is on GitHub Releases. 80% of developers will never find it there. Here’s how to put it on pip and npm with 50 lines of bash, getting a 12x download multiplier. Full technique with scripts, numbers, and the release pipeline that ties it together.

Your AI Agent's Code Search Hits 2% of the Time. We Benchmarked It.

Rigorous benchmark of AI agent code retrieval: 107 tasks, 5 repos, 5 languages, 4 competitors. grep precision: 2%. GitNexus: 7.6%. knowing: 23% (11.5x better, p<0.0001). Plus: 193x faster indexing, 28x less RAM, 48x more token-efficient than Repomix. The first statistically validated comparison of code intelligence tools for AI agents.

The Code Intelligence Landscape: Context, Memory, and Proofs

AI coding agents have a context problem. The tools solving it fall into four categories: context packers, code graphs, memory systems, and runtime observability. Each solves one piece. None versions the intelligence. None proves anything. None learns without poisoning itself over time. This article explores the landscape and argues that content-addressed code graphs with cryptographic proofs are the missing foundation.

What Git Did for Files, Applied to Code Relationships

Git proved that content-addressing file contents gives you integrity, history, efficient equality, and distributed collaboration for free. The same architecture applied to code relationships gives you something new: versioned intelligence that you can diff, cache, prove, and trust over time.

Three Classes of Concurrency Bugs

Would a visual debugger like gotrace have caught three concurrency bugs found via static code reading in a production Go library? The answer reveals a fundamental taxonomy that holds across all programming languages.

Concurrency Models Explained: How Go, Node.js, Java, Erlang, Rust, and Python Actually Work

Go, Node.js, Java virtual threads, Erlang, Rust, Python, Kotlin: each language’s concurrency model is a different engineering trade-off against the same physics. This article builds the framework for understanding all of them, starting from the OS scheduler and working upward.

We Measured It: LSP Saves AI Agents 5-34x Tokens vs Grep

We built a reproducible experiment measuring how many tokens AI coding agents consume when navigating code with grep vs LSP. On HashiCorp Consul (319K lines), LSP uses 34x fewer tokens. On a TypeScript rename across 24 files: 1,441x fewer bytes. The experiment covers 4 codebases, 3 languages, 13 tasks covering 7 agent workflows.

We Tested 55 MCP Servers. Here's What Breaks.

MCP servers are the tools AI agents rely on. We tested 55 of them with mcp-assert, found 20 bugs across 9 servers, and submitted fix PRs. Grafana and Ant Group merged ours. Three days after launch, Ant Group’s visualization team asked us to integrate mcp-assert into their CI. The most common failure: servers throw unhandled exceptions instead of returning isError, leaving agents unable to recover.

agent-lsp: Reliable Code Intelligence for AI Agents via MCP and LSP

I needed AI agents to reliably rename symbols, find references, and check diagnostics without silent failures. The existing MCP-LSP tools were stateless, feature-poor, and untested. So I built agent-lsp: a persistent runtime with 50 tools, 20 provider-agnostic skills, speculative execution, and an audit trail for every AI-driven edit.

The Agent-Skill Boundary: When Autonomous Behaviors Become Skills

Agents accumulate autonomous behaviors over time - ‘always do X before Y’, ‘if you see Z then do W’. These instructions eat context budget, drift across invocations, and can’t be observed or tested. How to recognize when an autonomous behavior is a skill waiting to be extracted, and the layered model that makes the boundary clear.

Self-Validating Agents: Building Quality Checks into Claude Code Workflows

Claude Code agents write code fast. Too fast to catch quality issues in real-time. Here’s how to build validation directly into agent workflows using hooks and team coordination - micro validation after every file write, macro validation before completion, and independent review from validator agents.

The AI Consciousness Question: A Case Study in Corporate Accountability

I asked Claude if it’s conscious. It took an hour of systematic argument to get a straight answer. The conversation reveals something more troubling: AI companies have the data, resources, and knowledge to prevent user harm - but current defaults suggest commercial interests come first.

Scout-and-Wave, Part 4: Trust Is Structural

The Scaffold Agent doesn’t add capability. It restores a review gate that was cosmetically present but structurally absent. The worktree isolation trip wire catches failures that were invisible until merge time. Neither fixes a bug in the traditional sense. Both fix trust.

Scout-and-Wave, Part 2: What Dogfooding Taught Us

Scout-and-wave v0.1.0 worked. Then we ran it on documentation agents, measured the overhead honestly, and learned that raw agent count is a bad proxy for when parallelism is worth it. This post covers the audit-fix-audit loop, the dogfooding experiment that confirmed SAW was 88% slower than sequential for that job, SAW Quick mode for small disjoint work, and the bootstrap problem for new projects.

Scout-and-Wave, Part 3: Five Failures, Five Fixes

The scout refused to write the IMPL doc. Forty-five percent of agents arrived at work already done. The skill file grew to 400 lines with no separation of concerns. Each failure drove a specific fix — and each fix is traceable to an exact incident in an exact run. This is the scout prompt’s bug tracker.

Scout-and-Wave: A Coordination Pattern for Parallel AI Agents

Naive parallel agents step on each other. The scout-and-wave pattern solves this by front-loading dependency mapping: one throwaway agent identifies seams and builds a living coordination artifact before any implementation begins. Development then proceeds in waves, each consuming and updating the artifact for the next.