One AI reviewing its own work is a student grading their own exam — confident, and sharing every blind spot that wrote the bug in the first place. The fix is almost stupid: ask other AIs, from other companies, that didn't write the code and don't share a brain. Here's why it works, the real scars from running it, and a tool you can run on your next git diff tonight.
I had a diff. I asked the same model that helped me write it to review it. It said — and I'm quoting the genre here — "Looks good. Tests pass. Ship it." It would have shipped a bug. Not a typo: a real logic hole that the tests didn't cover and the author-model couldn't see, because the author-model is exactly the wrong entity to catch its own mistake.
Then I sent the identical diff to three other models from three other labs. Within a minute, two of them — independently, having never spoken to each other — pointed at the same line. Three other AIs found the bug mine swore wasn't there.
When one model writes the code and the same model reviews it, you're asking one mind to find its own mistake. The confidence is real; the coverage isn't. Worse: even three instances of the same model share a lineage — same training data, same habits, same failure modes — so they tend to be wrong in the same direction. Three Claudes can high-five over a bug a Gemini would have caught in five seconds.
The mental model: a review is only worth something if the reviewer could have made a different mistake than the author. A coding-tuned model from a different company is the cheapest "different mistake" you can buy. You're not looking for a smarter reviewer — you're looking for an uncorrelated one.
The whole thing is four moves:
git diff, or a file. The thing under review.That's the giveaway at the bottom: one file, three providers, a consensus line. The integrated version — the 4/5-way committee that adjudicates, votes, and gates a merge — stays in the factory. But the single runner is genuinely most of the value, and it's yours.
This isn't theory — I ran a 5-way over a batch of work yesterday. The comedy and the lessons both held up:
Gemini Flash, asked to review with a file-reading tool, fell into a loop and printed "Wait, let's check if there is a file…" roughly forty times before timing out with zero actual review. Its Pro sibling did the dignified version: read the files, then returned a completely empty response. The fix: Gemini can't be trusted on a long tool-loop — inline the whole diff and force a single-shot answer. It went from useless to useful in one config change.
Codex (OpenAI) fails a lot — CLI drops, PATH issues, runtime hiccups. But when it actually runs, it's the one that reads the real code: it corrected my invented line numbers and named real functions a plan-only reviewer would've missed. Lesson: wire the high-value-low-reliability reviewer as a bonus seat, never a dependency.
GLM-5.2 (a code-tuned model on OpenRouter) out-reviewed the generalists at pennies a pass and took the standing 5th seat. Lesson: a purpose-built reviewer beats a bigger general model for this job — diversity of specialty, not just brand.
I front-run the panel with a cheap deterministic script (a "seat zero" no LLM can sweet-talk). It checked every required element was present — and passed a module whose interaction was stone dead, because a key element was nested in the wrong place. Counting that things exist is not the same as checking they're wired right. The three human-lineage reviewers caught it; my script didn't. Lesson: the deterministic seat and the model seats catch different classes of bug — you want both.
You don't have to write a line of this. Open Claude Code (or any agent) in an empty folder, paste the brief below, and it'll build the runner and walk you through the API keys. Same tool as the code section — one of you types it, the other talks it into being.
One file, no installs (Node 18+). It sends your diff to three model families in parallel, prints each review, and flags what two or more independently want fixed. Keys come from the environment; any provider you haven't set a key for is skipped, so it runs with two or three. Pipe a real diff straight in.
Keys: console.anthropic.com · aistudio.google.com · openrouter.ai/keys. Then:git diff | ANTHROPIC_API_KEY=… GEMINI_API_KEY=… OPENROUTER_API_KEY=… node cross-review.mjs