Recent Scorecards
Latest model evaluations across coding, reasoning, and tool-use tasks.
Why Engineers Trust This
Built for teams choosing AI models for production workloads.
Real engineering tasks
Bug fixes, build-vs-buy decisions, docs-driven workflows. No synthetic toy problems.
Rubrics you can audit
Every eval includes the exact prompt, scoring criteria, and failure cases. Fully reproducible.
Actionable recommendations
Pick a model for the job with clear rationale. We track cost, latency, and reliability -- not just accuracy.
How It Works
Transparent, reproducible benchmarks you can verify.
Real tasks
We run actual engineering prompts — bug fixes, tradeoffs, integrations.
Blind scoring
Two engineers score each output against a 10-point rubric.
Full transparency
Every prompt, rubric, and failure case is published.
Daily updates
New scorecards every morning at 6am ET.
What We Track
| Category | Fields |
|---|---|
| Performance | Task score, error rate, rubric compliance |
| Cost | Tokens used, estimated spend, cost per task |
| Latency | P50/P95 response time, time-to-first-token |
| Reliability | Failure cases, guardrail misses, audit notes |
Latest Articles
Deep dives on model performance, comparisons, and engineering benchmarks.
Daily Model Eval Scorecard — 2026-04-11
Head-to-head results across coding, reasoning, and tool-use tasks. Today: Gemini 3.1 Pro Preview, GPT-5.4 XHigh, Grok 4.20, and Gemma 4.
Daily Model Eval Scorecard — 2026-04-10
Head-to-head results across coding, reasoning, and tool-use tasks. Today: Gemini 3.1 Pro Preview, GPT-5.4 XHigh, Muse Spark, and GLM-5.1.
Daily Model Eval Scorecard — 2026-04-07
Head-to-head results across coding, reasoning, and tool-use tasks. Today: Gemini 3.1 Pro Preview, GPT-5.4 XHigh, Gemma 4, and Qwen 3.5.
Provider Pages
Quick picks by provider if you already know whose API you want to use.
Start with today's winner
Go straight to the daily scorecard, then drill into task-level breakdowns for coding, reasoning, and tool-use.
View scorecards →