Research project · 2026

Can AI Build a Game?

An independent benchmark of frontier AI coding ability. 580 scoreable builds across 9 models and 10 retro games. Public paper, library, and per-build judge data are all open.

580
Builds shipped
9
Frontier models
10
Retro games
397
V2 builds judged

Field notes

Problem
Public AI coding benchmarks are gameable, vibes-based reviews are noisy, and AI judges quietly inflate scores when nobody cross-validates. I wanted a fair, reproducible, hard-to-fake test of frontier AI coding ability.
What I owned
End-to-end design and execution. Spec authorship, a planner-vs-builder factorial, automated browser QA, a three-judge V2 scoring panel (Sonnet · Gemini Pro · GPT-5.4), data audit, the public research paper, and the interactive library.
Constraints
Zero human intervention during builds. No retries. Real browser launches via Playwright, not static lint checks. Cross-validation across three independent AI judges to surface scoring drift.
Headline finding
The premium tier is a statistical tie — Sonnet (8.62) and GPT-5.4 (8.58) are separated by 0.04 points. Visual fidelity is universally weak. Specs are model-dependent and can hurt. And the AI judges themselves disagree more than the leaderboard suggests.
Artifacts
Open paper, interactive library of all 580 playable builds, judge data viewer, planner-vs-builder side-by-side, and the underlying audited dataset — all at benchmark.oscraven.com. Source on GitHub.

§ 01 — Premise

Why this benchmark exists

AI coding ability is hard to evaluate honestly. Public benchmarks like HumanEval and SWE-bench are gameable — models train on the test sets and rankings stop reflecting real-world capability. Vibes-based "I tried the new model" reviews are noisy and dominated by influencer dynamics.

The deeper problem is judge calibration. When you ask one AI to score another AI's output, it inflates. In V1, Gemini Flash inflated the consensus score by roughly 2.4 points across the board. The bias was invisible until I cross-validated against two other judges. It isn't taking sides — it inflates everyone, including its own competitors.

I wanted to know: can today's frontier models actually build something that runs, that someone can use, with no help from a human? And how do you measure the answer in a way that's genuinely hard to fake?

§ 02 — Method

Four design choices that make the result hard to game

§ 03 — Exhibits

Easy, medium, hard

One game from each band of the difficulty curve. Scores are the V2 three-judge consensus across all 40 builds per game (eight builders × five planning conditions). The full grid is in the QA data viewer.

§ 04 — Lessons

Eight lessons from 580 builds

The full lessons (with live evidence from the dataset) are on the interactive dashboard. Headlines below in the order they were learned.

  1. 01

    Have a plan — but decide what to measure before you start.

    You can't tell improvement from noise without a fixed measurement rig. V1 and V2 used different judge panels, so their scores aren't comparable — the apparent "shifts" are measurement artefacts, not model changes. Lock your rubric, judges and weights before scoring anything.

  2. 02

    Context is methodology — isolated vs accumulated changes who "wins".

    V1 builders saw prior game code in the same session (accumulated). V2 used a fresh session per build (isolated). That single change flipped Opus from #3 (V1, 8.22) to #1 (V2, 7.77). Match the benchmark to the deployment — for one-shot generation test isolated; for agent pipelines with memory test accumulated. Don't mix.

  3. 03

    Training data decides — well-known tasks nail it, novel ones don't.

    Game complexity matters less than representation in training data. Snake averages 8.72; Donkey Kong averages 6.09 — same models, same specs, a 2.63-point gap that no model choice closes. Before trusting one-shot AI code, ask how often this exact problem has appeared in training data.

  4. 04

    Specs can hurt — especially for smaller builders.

    V2's control condition (game name only, no spec) often outperformed detailed specs. Haiku built Snake at 7.93/10 with no spec, then collapsed to 1.43/10 when given an Opus-written plan — a 6.5-point drop on the same model and same game. Most builders score worse with specs; only GPT-5.4 meaningfully benefits (+0.74). Calibrate spec complexity to builder capability.

  5. 05

    The AI Iron Triangle — when speed and cost buy appeasement, not truth.

    Gemini Flash was the fastest, cheapest judge in V1. It rated everything 9–10. That wasn't quality — that was appeasement. Flash inflated every builder's consensus by roughly +2.4 points; it rated o3-mini (broken output) at 6.97, higher than other judges rated functional Haiku builds. Cheap and fast AI often means agreeable, not accurate. Don't trust single-AI evaluations for anything load-bearing.

  6. 06

    Taste is non-negotiable — AI cannot grade AI alone.

    Static AI judges read code; they see mechanics, not gameplay. Many V1/V2 builds rated 8+ by AI judges are actually unwinnable, have broken collision, or fail in ways a 30-second playtest would catch. Build a human review loop into any AI-evaluation pipeline. Use AI to triage and sort. Use humans for the final grade on anything that matters.

  7. 07

    It's the pair, not the model — agent pipelines live or die on planner × builder combos.

    The best builder on its own is not always the best builder in a pipeline. V2 shows strong planner × builder combinations and weak ones — same builder, different specs, different outcomes. When designing an agent system, test the planner and builder together. A "good" planner paired with a "good" builder can score worse than two mid-tier models with better chemistry.

  8. 08

    Know natural limits — one-shot only works for well-known tasks.

    For famous problems (Snake, Pong) current frontier models one-shot surprisingly well. For obscure ones (Donkey Kong with barrel AI) they struggle regardless of spec detail or cost tier. The failure mode is predictable. For novel problems, plan for refinement, test-driven iteration, or hybrid human-AI workflows — one-shot success rate is an artefact of training familiarity, not model capability.

§ 05 — Implications

What this means for product work

This started as a curiosity project but the eight lessons map directly onto applied AI product work:

These are the same questions clients ask when shipping AI features — answered with receipts from 580 builds rather than vibes.

§ 06 — Open data

The full benchmark, open

The research paper, the playable library of 580 builds, the per-judge score data and the side-by-side spec / plan viewer are all open at benchmark.oscraven.com. Source on GitHub.