Research project · 2026

Can AI Build a Game?

An independent benchmark of frontier AI coding ability. 580 scoreable builds across 9 models and 10 retro games. Public paper, library, and per-build judge data are all open.

580

Builds shipped

Frontier models

Retro games

397

V2 builds judged

Read the paper Explore the library

Field notes

Problem: Public AI coding benchmarks are gameable, vibes-based reviews are noisy, and AI judges quietly inflate scores when nobody cross-validates. I wanted a fair, reproducible, hard-to-fake test of frontier AI coding ability.
What I owned: End-to-end design and execution. Spec authorship, a planner-vs-builder factorial, automated browser QA, a three-judge V2 scoring panel (Sonnet · Gemini Pro · GPT-5.4), data audit, the public research paper, and the interactive library.
Constraints: Zero human intervention during builds. No retries. Real browser launches via Playwright, not static lint checks. Cross-validation across three independent AI judges to surface scoring drift.
Headline finding: The premium tier is a statistical tie — Sonnet (8.62) and GPT-5.4 (8.58) are separated by 0.04 points. Visual fidelity is universally weak. Specs are model-dependent and can hurt. And the AI judges themselves disagree more than the leaderboard suggests.
Artifacts: Open paper, interactive library of all 580 playable builds, judge data viewer, planner-vs-builder side-by-side, and the underlying audited dataset — all at benchmark.oscraven.com. Source on GitHub.

§ 01 — Premise

Why this benchmark exists

AI coding ability is hard to evaluate honestly. Public benchmarks like HumanEval and SWE-bench are gameable — models train on the test sets and rankings stop reflecting real-world capability. Vibes-based "I tried the new model" reviews are noisy and dominated by influencer dynamics.

The deeper problem is judge calibration. When you ask one AI to score another AI's output, it inflates. In V1, Gemini Flash inflated the consensus score by roughly 2.4 points across the board. The bias was invisible until I cross-validated against two other judges. It isn't taking sides — it inflates everyone, including its own competitors.

I wanted to know: can today's frontier models actually build something that runs, that someone can use, with no help from a human? And how do you measure the answer in a way that's genuinely hard to fake?

§ 02 — Method

Four design choices that make the result hard to game

Real, observable artifacts

Each model builds a complete playable HTML5 retro game from a single prompt. No partial code, no chain-of-thought tricks — the deliverable is a working game in a browser. If it doesn't run, it scores zero.
Spec-vs-builder factorial

Two stages. A planner writes the spec; a builder implements it. V2 ran five planning conditions (Opus, GPT-5.4, Gemini Pro, GLM-5, plus a no-spec control) crossed with eight builders, giving 40 cells per game. This isolates whether quality lives in the spec or in the implementation.
Three-judge cross-validation

Each V2 build is scored independently by Sonnet, Gemini Pro and GPT-5.4 — three different families, no shared training-data lineage. When judges disagree, the spread itself is the signal. The Tetris example: judges scored the same build 4.4, 6.0 and 9.8 — a 5.4-point spread on identical evidence.
Real browser QA

Every build is launched headless via Playwright. JavaScript errors, missing canvases, broken event handlers — caught by the harness, not by a judge being polite. V2 had 2 DNFs across 400 builds (0.5%); failure was the rare exception.

§ 03 — Exhibits

Easy, medium, hard

One game from each band of the difficulty curve. Scores are the V2 three-judge consensus across all 40 builds per game (eight builders × five planning conditions). The full grid is in the QA data viewer.

Easy · Snake

Training data is everywhere

Snake appears in a thousand tutorials. Every builder shipped a playable game; spread is narrow. What "AI does well" looks like.

7.81 / 10

Play →

Medium · Breakout

Continuous physics, mostly handled

Brick collisions and ball physics test whether the model handles continuous state. Strong builders cope; smaller ones drift on edge cases.

7.63 / 10

Play →

Hard · Pac-Man

Ghost AI breaks the field

Four ghosts with distinct behaviours plus a maze grid. Models that nailed Snake and Breakout drop more than two points here. The hard end of the difficulty curve.

5.50 / 10

Open →

§ 04 — Lessons

Eight lessons from 580 builds

The full lessons (with live evidence from the dataset) are on the interactive dashboard. Headlines below in the order they were learned.

01

Have a plan — but decide what to measure before you start.

You can't tell improvement from noise without a fixed measurement rig. V1 and V2 used different judge panels, so their scores aren't comparable — the apparent "shifts" are measurement artefacts, not model changes. Lock your rubric, judges and weights before scoring anything.
02

Context is methodology — isolated vs accumulated changes who "wins".

V1 builders saw prior game code in the same session (accumulated). V2 used a fresh session per build (isolated). That single change flipped Opus from #3 (V1, 8.22) to #1 (V2, 7.77). Match the benchmark to the deployment — for one-shot generation test isolated; for agent pipelines with memory test accumulated. Don't mix.
03

Training data decides — well-known tasks nail it, novel ones don't.

Game complexity matters less than representation in training data. Snake averages 8.72; Donkey Kong averages 6.09 — same models, same specs, a 2.63-point gap that no model choice closes. Before trusting one-shot AI code, ask how often this exact problem has appeared in training data.
04

Specs can hurt — especially for smaller builders.

V2's control condition (game name only, no spec) often outperformed detailed specs. Haiku built Snake at 7.93/10 with no spec, then collapsed to 1.43/10 when given an Opus-written plan — a 6.5-point drop on the same model and same game. Most builders score worse with specs; only GPT-5.4 meaningfully benefits (+0.74). Calibrate spec complexity to builder capability.
05

The AI Iron Triangle — when speed and cost buy appeasement, not truth.

Gemini Flash was the fastest, cheapest judge in V1. It rated everything 9–10. That wasn't quality — that was appeasement. Flash inflated every builder's consensus by roughly +2.4 points; it rated o3-mini (broken output) at 6.97, higher than other judges rated functional Haiku builds. Cheap and fast AI often means agreeable, not accurate. Don't trust single-AI evaluations for anything load-bearing.
06

Taste is non-negotiable — AI cannot grade AI alone.

Static AI judges read code; they see mechanics, not gameplay. Many V1/V2 builds rated 8+ by AI judges are actually unwinnable, have broken collision, or fail in ways a 30-second playtest would catch. Build a human review loop into any AI-evaluation pipeline. Use AI to triage and sort. Use humans for the final grade on anything that matters.
07

It's the pair, not the model — agent pipelines live or die on planner × builder combos.

The best builder on its own is not always the best builder in a pipeline. V2 shows strong planner × builder combinations and weak ones — same builder, different specs, different outcomes. When designing an agent system, test the planner and builder together. A "good" planner paired with a "good" builder can score worse than two mid-tier models with better chemistry.
08

Know natural limits — one-shot only works for well-known tasks.

For famous problems (Snake, Pong) current frontier models one-shot surprisingly well. For obscure ones (Donkey Kong with barrel AI) they struggle regardless of spec detail or cost tier. The failure mode is predictable. For novel problems, plan for refinement, test-driven iteration, or hybrid human-AI workflows — one-shot success rate is an artefact of training familiarity, not model capability.

§ 05 — Implications

What this means for product work

This started as a curiosity project but the eight lessons map directly onto applied AI product work:

Spec quality is part of the product.

If a detailed spec can collapse a builder's score by 6.5 points, prompt design isn't a developer chore — it's a product surface. The same logic applies to RAG retrieval prompts, agent tool descriptions, and chat system prompts. Calibrate spec complexity to the model that has to read it.
LLM-as-judge needs cross-validation.

Anyone using LLM-as-judge for evals, content moderation, or agent self-correction is exposed to the same appeasement bias Gemini Flash showed in V1. Single-judge architectures are fragile by default; use at least two judges that agree on relative ordering, and re-run when you change the panel.
Test the pair, not the parts.

Agent pipelines are planner × builder combinations. Picking the best of each in isolation reliably under-performs picking the pair with the best chemistry. Test the combinations you'll actually deploy.
Human review is the calibration anchor.

I'm 104/580 builds into a full human re-rating pass. Static AI judges grade code; humans grade games. The two diverge often enough that any product depending on LLM-as-judge needs a human-in-the-loop checkpoint on outputs that matter.

These are the same questions clients ask when shipping AI features — answered with receipts from 580 builds rather than vibes.

§ 06 — Open data

The full benchmark, open

The research paper, the playable library of 580 builds, the per-judge score data and the side-by-side spec / plan viewer are all open at benchmark.oscraven.com. Source on GitHub.

Read the paper

Full methodology, all eight findings, audited numbers

↗

Explore the library

All 580 playable builds in the browser

↗

Lessons

Per-finding write-ups with the underlying data

↗

Judge data viewer

Per-build, per-judge, per-dimension scores

↗