Field GuideCross-Provider Review

The bug your model swears isn't there.

One AI reviewing its own work is a student grading their own exam — confident, and sharing every blind spot that wrote the bug in the first place. The fix is almost stupid: ask other AIs, from other companies, that didn't write the code and don't share a brain. Here's why it works, the real scars from running it, and a tool you can run on your next git diff tonight.

the scene why self-review fails the architecture real scars make it reliable build by talking the working code

01 · the scene

"Looks good. Tests pass. Ship it."

I had a diff. I asked the same model that helped me write it to review it. It said — and I'm quoting the genre here — "Looks good. Tests pass. Ship it." It would have shipped a bug. Not a typo: a real logic hole that the tests didn't cover and the author-model couldn't see, because the author-model is exactly the wrong entity to catch its own mistake.

Then I sent the identical diff to three other models from three other labs. Within a minute, two of them — independently, having never spoken to each other — pointed at the same line. Three other AIs found the bug mine swore wasn't there.

02 · why self-review fails

Same brain, same blind spot

When one model writes the code and the same model reviews it, you're asking one mind to find its own mistake. The confidence is real; the coverage isn't. Worse: even three instances of the same model share a lineage — same training data, same habits, same failure modes — so they tend to be wrong in the same direction. Three Claudes can high-five over a bug a Gemini would have caught in five seconds.

The mental model: a review is only worth something if the reviewer could have made a different mistake than the author. A coding-tuned model from a different company is the cheapest "different mistake" you can buy. You're not looking for a smarter reviewer — you're looking for an uncorrelated one.

03 · the architecture

A committee of robots that don't share a brain

The whole thing is four moves:

Take the diff — a git diff, or a file. The thing under review.
Fan it out — send the same skeptical-reviewer prompt to N models from different families, in parallel. (I run Anthropic + Google + a coding-tuned model via OpenRouter.)
Collect the verdicts — each comes back with findings + a one-line SHIP / FIX.
Read the consensus, not the loudest voice — the signal isn't any single review. It's what two different lineages independently flag. One model alone is an opinion; two unrelated models agreeing is a finding.

That's the giveaway at the bottom: one file, three providers, a consensus line. The integrated version — the 4/5-way committee that adjudicates, votes, and gates a merge — stays in the factory. But the single runner is genuinely most of the value, and it's yours.

04 · real scars

What actually happens when you run a panel of robots

This isn't theory — I ran a 5-way over a batch of work yesterday. The comedy and the lessons both held up:

scar · the model that lost its mind

Gemini Flash, asked to review with a file-reading tool, fell into a loop and printed "Wait, let's check if there is a file…" roughly forty times before timing out with zero actual review. Its Pro sibling did the dignified version: read the files, then returned a completely empty response. The fix: Gemini can't be trusted on a long tool-loop — inline the whole diff and force a single-shot answer. It went from useless to useful in one config change.

scar · the flaky genius

Codex (OpenAI) fails a lot — CLI drops, PATH issues, runtime hiccups. But when it actually runs, it's the one that reads the real code: it corrected my invented line numbers and named real functions a plan-only reviewer would've missed. Lesson: wire the high-value-low-reliability reviewer as a bonus seat, never a dependency.

win · the specialist earned its chair

GLM-5.2 (a code-tuned model on OpenRouter) out-reviewed the generalists at pennies a pass and took the standing 5th seat. Lesson: a purpose-built reviewer beats a bigger general model for this job — diversity of specialty, not just brand.

scar · presence is not structure

I front-run the panel with a cheap deterministic script (a "seat zero" no LLM can sweet-talk). It checked every required element was present — and passed a module whose interaction was stone dead, because a key element was nested in the wrong place. Counting that things exist is not the same as checking they're wired right. The three human-lineage reviewers caught it; my script didn't. Lesson: the deterministic seat and the model seats catch different classes of bug — you want both.

05 · make it reliable

How to keep the panel honest

Diversity beats redundancy. Three different companies > three of the same model. The point is uncorrelated mistakes.
Inline context for models that can't read files. Don't make a model hunt for the diff through a tool loop — paste it into the prompt. (This is the whole Gemini fix.)
One failing seat never sinks the run. Catch each reviewer's error and keep the others — a flaky genius shouldn't take down the panel.
Trust the overlap. Weight a finding by how many independent lineages raised it. Two unrelated models on the same line = stop and look.
Add a deterministic seat-zero for the mechanical stuff (does it parse? are the required pieces present and wired?) — it's free and it can't be charmed.

06 · build it by talking

Don't want to touch code? Describe it.

You don't have to write a line of this. Open Claude Code (or any agent) in an empty folder, paste the brief below, and it'll build the runner and walk you through the API keys. Same tool as the code section — one of you types it, the other talks it into being.

07 · the working code

Run it on your next diff

One file, no installs (Node 18+). It sends your diff to three model families in parallel, prints each review, and flags what two or more independently want fixed. Keys come from the environment; any provider you haven't set a key for is skipped, so it runs with two or three. Pipe a real diff straight in.

Keys: console.anthropic.com · aistudio.google.com · openrouter.ai/keys. Then:
git diff | ANTHROPIC_API_KEY=… GEMINI_API_KEY=… OPENROUTER_API_KEY=… node cross-review.mjs