Last run: 2026-04-17 21:15 UTC · 4 models
Same 28 factual questions. Every major frontier model. Each runs twice: once from memory alone ("cold"), once with access to our MCP server as its only tool. Ground truth is computed live from our database at the moment we ask.
The best model scored 8% cold — with our MCP server it scored 42%. (Best MCP score across all models: 50%.)
That gap is what an MCP server buys you: turning confident guesswork into live, verifiable data.
Each model answers 28 questions twice: once with no tools (pure parametric knowledge), once with our MCP server wired up as its only tool. Ground truth is computed at run time by querying our D1 database directly — the benchmark tests against current data, not a frozen snapshot. Numeric answers score as correct with an exact match or within ±5% for averages/percentages. List answers use set-overlap (Jaccard index ≥ 0.8). Free-text answers are graded by Claude Haiku 4.5 as an LLM-as-judge. Questions are excluded from the denominator when the ground-truth query returns empty at run time (e.g. a not-yet-populated sector table), or when the model API fails after 3 retries — so a network blip or a missing data slice doesn't count against a model's accuracy. The scored denominator is shown next to each percentage. The full question set and scoring code is in the repo.
Sorted by cold-mode accuracy — how well each model does without tools. "MCP" is the same model, same question, but with access to our MCP server.
| # | Model | Cold | With MCP | Delta | Last run |
|---|---|---|---|---|---|
| 1 | grok-4 | 8% (2/26) | 42% (10/24) | +34 pts | Apr 17, 2026 UTC |
| 2 | gpt-5 | 4% (1/26) | 38% (10/26) | +35 pts | Apr 17, 2026 UTC |
| 3 | claude-opus-4-6 | 0% (0/25) | 48% (12/25) | +48 pts | Apr 17, 2026 UTC |
| 4 | claude-sonnet-4-6 | 0% (0/25) | 50% (13/26) | +50 pts | Apr 17, 2026 UTC |
C/D MCP tool fixes deployed 2026-04-18 (UTC) (commit 203edcf, Worker version e0d54de0). Effect not yet measured — leaderboard above predates this deploy.
Where models fall down without tools. Each cell shows correct/total for that category.
| Model | Aggregate | Chamber & Party | Member-Level | Ticker-Level | Committee-Level |
|---|---|---|---|---|---|
| claude-opus-4-6 | 0% (0/3) | 0% (0/4) | 0% (0/10) | 0% (0/4) | 0% (0/4) |
| claude-sonnet-4-6 | 0% (0/3) | 0% (0/4) | 0% (0/11) | 0% (0/3) | 0% (0/4) |
| gpt-5 | 0% (0/3) | 25% (1/4) | 0% (0/11) | 0% (0/4) | 0% (0/4) |
| grok-4 | 0% (0/3) | 25% (1/4) | 9% (1/11) | 0% (0/4) | 0% (0/4) |
These are the questions where parametric knowledge fails universally — and where MCP tools earn their keep.
One MCP endpoint. 12 tools. Same data that powers this benchmark. Works with Claude Desktop, Cursor, Claude Code, and any MCP-compatible client.