Almost none of vibe coding is writing code — it's making connections: model to tool, app to platform, your data to the thing that acts on it. So the evaluations are the product. Here's the honest ledger — winners, scars, and the ones we dropped on sight.
✓ keeps working~ works, with scars✗ gave up / rejected🗓 weekly-eval💡 micro-lesson
A Platforms & hosting
The deploy / infra layer — where the app actually lives.
DigitalOcean
Confusing but good — cheap, established. The console fights you once; after that it's the always-on workhorse (one droplet runs the bots via PM2). Best $/value in the stack.
✓ keeps working
Vercel
Great when it works — git push = a deploy. But it fails silently when your commit email ≠ your account email, and vercel env add sneaks a trailing newline into your secret. Two scars that cost hours.
~ scars
Supabase
Backend-in-a-box, generous free tier — but Row-Level Security is on you. Deny-all first, open deliberately. Forget it and your data is public.
~ scars
Cloudflare
Workers are rock-solid glue (the Sentry→Telegram forwarder runs here). Free tier carries a lot.
✓ keeps working
Render
The place for a Docker + headless-Chromium job (rendering a PDF report) when Vercel's runtime won't do it.
~ works
Netlify
Confusing at the start (same as DigitalOcean) — but worked well in the end. The onboarding fights you; the platform doesn't.
✓ keeps working
Flask
When all you need is a URL around a script — minutes, not a framework. The anti-over-engineering win.
✓ keeps working
Hostinger
Sucked — proprietary builder, no file export. You build your site and can't get it out. The escape hatch (an API migration to Wix) hit its own ceiling — visual editing needs the Wix Editor, not the API. Lock-in is the whole story.
✗ gave up
B Fighting the legacy behemoths
The big-vendor surfaces you have to wrestle. Ranked by pain.
Microsoft
The worst — and it's the accounts, not the code. A personal Microsoft account tangled with a William & Mary org account means I can never reliably reach Teams or shared files; a work-machine org conflict blocked even Claude-for-Excel from logging in; the Azure CLI just fails auth. And the menu is never the one you were told to click.
✗ the worst
Google
Bad — same multi-account trap. "MSFT or GOOG, I can never access the shared thing." Plus AI Studio key wiring and a flaky Gemini API (empty responses, flash loops). The documented menu path is never the real one.
~ rough
GitHublogin / credentialing
Very good. Login-as-credential just works — modern auth done right. Also clean for private deps (GitHub Packages). The behemoth that doesn't fight you.
✓ winner
Twilio
A compliance maze before you can send one SMS — Console compliance profile + a Persona ID + an approved number. Works, but budget days not hours.
~ scars
C Data, services & tools
The connectors and stores the apps lean on.
Playwright
The automation that just runs — page export, and an autonomous beta-user swarm clicking through the product. Reliable.
✓ keeps working
Airtable
A clean base-of-record when a spreadsheet-shaped store is the right call (FloodStream).
✓ keeps working
Notion (API + MCP)
The hub for tasks + specs — but it normalizes your markdown (re-fetch before you append) and direct REST beats the MCP for writes. Great to read, fiddly to write.
~ scars
Stripe / Lemon Squeezy
Lemon Squeezy shipped billing first (merchant-of-record removes the tax headache); Stripe metered for usage-based later.
~ works
Permissioning cross-cutting
The single most repeated "why won't this connect." Two faces: vendor auth (personal-vs-work accounts, org/tenant conflicts, OAuth scopes, RLS) and agent permissions (finding the mode between "ask me every step" and "dangerously skip everything").
~ the constant fight
D Models & providers
Tiered by job: best model for business-critical work, cost-controlled for consumer.
OpenRouter
The aggregator that works — one key, 25+ models. How the review seats and the droplet router reach everything without juggling accounts.
✓ keeps working
Anthropic / Claude
Best reasoning, the core build model — but weak connectors (see the matrix below). Strongest brain, clumsiest hands.
~ build-core
OpenAI / GPT (+ Codex)
Strong as an agent, not just a chat box — and the best connectors in the set (GitHub + Notion, see below).
✓ keeps working
GLM-5.2 (z.ai)
Won the 5th review seat — a code-tuned model that out-reviews the generalists at pennies a pass. Replaced Kimi.
✓ earned a seat
Cerebras · Qwen · DeepSeek
Fast / cheap perspectives in the review pipeline; cycled through the weekly eval.
🗓 in rotation
Google / Gemini
As a review seat: constant context trouble — it can't read files, so you must inline the whole diff, and it still reviews against stale/truncated context or returns empty / loops (watched both today). The reasoning-trace-on-timeout is a genuinely useful audit signal; cheap on flash. A perspective, not a workhorse.
~ flaky reviewer
Codex (OpenAI) as a review seat
Fails a lot operationally — CLI / PATH / runtime drops mean it doesn't always run. But when it does, it's the one that reads the actual code: it's corrected invented line numbers and named real functions a plan-only review would miss. High value, low reliability — wire it as a bonus seat, not a dependency.
~ flaky, reads real code
Kimi (Moonshot)
Retired from the factory — GLM-5.2 beat it for the review seat.
✗ retired
Fable 5
Pulled from the factory.
✗ pulled
Hermes (Nous)
Keeps landing on WATCH — and that's the filter working. Its one differentiator (no refusals / "uncensored," neutral alignment) is the opposite of what tax + compliance products need — they want MORE output-side friction, not less. The OpenRouter volume that flags it is roleplay/companion traffic, a market we're not in. And it loses every staffed seat: Haiku won the swarm bench, the review panel already spans 5 families, extraction is deterministic-first. Fails all four admission conditions. (Factory review, 2026-06-13.)
✗ WATCH
E Connector quality — the matrix that matters
"Most of this is making connections, so all of these things matter." Same model, very different hands.
Model / agent
→ GitHub
→ Notion
Overall
OpenAI
great
great
best-in-class connectors
Claude
weak
weak
sucks at connectors (strong everywhere else)
AntiGravity
fine
fine
no problem
Cursor
used
—
in the mix, but privacy-gated for client / tax code (vendor review first)
Codex and Gemini are judged as review seats in section D, not connectors — Codex reads real code but drops a lot; Gemini can't read files and chokes on context.
F The weekly factory eval
New models and tools get tested every week. Most don't make it — and why is the story.
The cadence
Four standing crons watch the frontier (models Sunday, practices + editors Tuesday). Nothing enters the factory on hype — only on a measured pass.
🗓 weekly
The bar
Default to the best model; a challenger earns a seat only by out-reviewing the incumbent on real diffs. Rejections (Hermes) are kept as receipts.
✓ the moat
G The little things that compound
Quality-of-life lessons nobody tells you — the "why didn't I know this" moments.
Open it, don't link it
Never hand back a path to something you saved. Open it in the browser so I can see it. A file I have to go chase is a failure, not a deliverable.
💡 lesson
Go get the screenshot
"I took a screenshot" = the agent goes and finds it, I never dig out a filepath. Newest file wins — desktop in the Screenshots folder, phone in the synced Dropbox folder; I just say which. (Only matters in terminal CC, where you can't drag-drop an image; on upload-capable clients you just paste it.)
💡 lesson
Drag-drop reality
You can't drag-and-drop a file into Claude Code in the terminal — but you can into AntiGravity or Codex. Know each tool's real input affordances.
💡 lesson
Output goes home
Deliverables land in one known output folder and open themselves. No scavenger hunts.
💡 lesson
Don't send me into a menu
"Nothing ever works as you describe, and the menus aren't worth going through — too convoluted." If a task is clickable, do it for me or script it — don't narrate a vendor menu path that won't match what's actually on my screen.
💡 lesson
one thing left (the rest came from your own records)
Real cost numbers — your actual monthly figures, to make the cost comparison land.
Optional: Cursor's GitHub/Notion connector quality, if you ever ran it for that.
Resolved from your own records (no asking): Microsoft & Google (multi-account/tenant hell + convoluted menus), Hostinger (proprietary builder, no export), permissioning (vendor auth + agent modes), the menu lesson — from session transcripts; Hermes WATCH rationale — from the Notion factory review; Codex/Gemini reviewer reliability — from the review ledgers.