🏛️ Congress Trade Alerts

Last run: 2026-04-17 21:15 UTC · 4 models

How well do AI models know congressional trading?

Same 28 factual questions. Every major frontier model. Each runs twice: once from memory alone ("cold"), once with access to our MCP server as its only tool. Ground truth is computed live from our database at the moment we ask.

The best model scored 8% cold — with our MCP server it scored 42%. (Best MCP score across all models: 50%.)

That gap is what an MCP server buys you: turning confident guesswork into live, verifiable data.

Methodology

Each model answers 28 questions twice: once with no tools (pure parametric knowledge), once with our MCP server wired up as its only tool. Ground truth is computed at run time by querying our D1 database directly — the benchmark tests against current data, not a frozen snapshot. Numeric answers score as correct with an exact match or within ±5% for averages/percentages. List answers use set-overlap (Jaccard index ≥ 0.8). Free-text answers are graded by Claude Haiku 4.5 as an LLM-as-judge. Questions are excluded from the denominator when the ground-truth query returns empty at run time (e.g. a not-yet-populated sector table), or when the model API fails after 3 retries — so a network blip or a missing data slice doesn't count against a model's accuracy. The scored denominator is shown next to each percentage. The full question set and scoring code is in the repo.

Leaderboard

Sorted by cold-mode accuracy — how well each model does without tools. "MCP" is the same model, same question, but with access to our MCP server.

#ModelColdWith MCPDeltaLast run
1 grok-4 8% (2/26) 42% (10/24) +34 pts Apr 17, 2026 UTC
2 gpt-5 4% (1/26) 38% (10/26) +35 pts Apr 17, 2026 UTC
3 claude-opus-4-6 0% (0/25) 48% (12/25) +48 pts Apr 17, 2026 UTC
4 claude-sonnet-4-6 0% (0/25) 50% (13/26) +50 pts Apr 17, 2026 UTC
Footnote claude-opus-4-7 · 7.7% (2/26) cold · Apr 18, 2026 UTC · Manual chat run — not comparable to automated rows
Opus 4.7 cold was hand-fed via claude.ai on 2026-04-18 (UTC) — a manual chat run, not a programmatic benchmark. 2 of 26 in-scope questions scored correct (Q7 committees + Q23 party return). 2 questions skipped (Q6, Q20 — empty ground truth). Strict-substring used for string questions because the Haiku LLM-judge was unavailable (same Anthropic billing lapse that prevented an automated Opus 4.7 run); actual score may be slightly higher under LLM judging. No MCP run — the manual-chat format doesn’t support tool use.

C/D MCP tool fixes deployed 2026-04-18 (UTC) (commit 203edcf, Worker version e0d54de0). Effect not yet measured — leaderboard above predates this deploy.

Category breakdown (cold-mode accuracy)

Where models fall down without tools. Each cell shows correct/total for that category.

ModelAggregateChamber & PartyMember-LevelTicker-LevelCommittee-Level
claude-opus-4-60% (0/3)0% (0/4)0% (0/10)0% (0/4)0% (0/4)
claude-sonnet-4-60% (0/3)0% (0/4)0% (0/11)0% (0/3)0% (0/4)
gpt-50% (0/3)25% (1/4)0% (0/11)0% (0/4)0% (0/4)
grok-40% (0/3)25% (1/4)9% (1/11)0% (0/4)0% (0/4)

Questions every AI got wrong (cold mode)

These are the questions where parametric knowledge fails universally — and where MCP tools earn their keep.

Connect your AI to our data

One MCP endpoint. 12 tools. Same data that powers this benchmark. Works with Claude Desktop, Cursor, Claude Code, and any MCP-compatible client.

See MCP setup →