The LedgerVendor · Model · Connector evals

What we tried.
What kept working.

Almost none of vibe coding is writing code — it's making connections: model to tool, app to platform, your data to the thing that acts on it. So the evaluations are the product. Here's the honest ledger — winners, scars, and the ones we dropped on sight.

✓ keeps working ~ works, with scars ✗ gave up / rejected 🗓 weekly-eval 💡 micro-lesson

A Platforms & hosting

The deploy / infra layer — where the app actually lives.

DigitalOcean

Confusing but good — cheap, established. The console fights you once; after that it's the always-on workhorse (one droplet runs the bots via PM2). Best $/value in the stack.

✓ keeps working

Vercel

Great when it works — git push = a deploy. But it fails silently when your commit email ≠ your account email, and vercel env add sneaks a trailing newline into your secret. Two scars that cost hours.

~ scars

Supabase

Backend-in-a-box, generous free tier — but Row-Level Security is on you. Deny-all first, open deliberately. Forget it and your data is public.

~ scars

Cloudflare

Workers are rock-solid glue (the Sentry→Telegram forwarder runs here). Free tier carries a lot.

✓ keeps working

Render

The place for a Docker + headless-Chromium job (rendering a PDF report) when Vercel's runtime won't do it.

~ works

Netlify

Confusing at the start (same as DigitalOcean) — but worked well in the end. The onboarding fights you; the platform doesn't.

✓ keeps working

Flask

When all you need is a URL around a script — minutes, not a framework. The anti-over-engineering win.

✓ keeps working

Hostinger

Sucked — proprietary builder, no file export. You build your site and can't get it out. The escape hatch (an API migration to Wix) hit its own ceiling — visual editing needs the Wix Editor, not the API. Lock-in is the whole story.

✗ gave up

B Fighting the legacy behemoths

The big-vendor surfaces you have to wrestle. Ranked by pain.

Microsoft

The worst — and it's the accounts, not the code. A personal Microsoft account tangled with a William & Mary org account means I can never reliably reach Teams or shared files; a work-machine org conflict blocked even Claude-for-Excel from logging in; the Azure CLI just fails auth. And the menu is never the one you were told to click.

✗ the worst

Google

Bad — same multi-account trap. "MSFT or GOOG, I can never access the shared thing." Plus AI Studio key wiring and a flaky Gemini API (empty responses, flash loops). The documented menu path is never the real one.

~ rough

GitHublogin / credentialing

Very good. Login-as-credential just works — modern auth done right. Also clean for private deps (GitHub Packages). The behemoth that doesn't fight you.

✓ winner

Twilio

A compliance maze before you can send one SMS — Console compliance profile + a Persona ID + an approved number. Works, but budget days not hours.

~ scars

C Data, services & tools

The connectors and stores the apps lean on.

Playwright

The automation that just runs — page export, and an autonomous beta-user swarm clicking through the product. Reliable.

✓ keeps working

Airtable

A clean base-of-record when a spreadsheet-shaped store is the right call (FloodStream).

✓ keeps working

Notion (API + MCP)

The hub for tasks + specs — but it normalizes your markdown (re-fetch before you append) and direct REST beats the MCP for writes. Great to read, fiddly to write.

~ scars

Stripe / Lemon Squeezy

Lemon Squeezy shipped billing first (merchant-of-record removes the tax headache); Stripe metered for usage-based later.

~ works

Permissioning cross-cutting

The single most repeated "why won't this connect." Two faces: vendor auth (personal-vs-work accounts, org/tenant conflicts, OAuth scopes, RLS) and agent permissions (finding the mode between "ask me every step" and "dangerously skip everything").

~ the constant fight

D Models & providers

Tiered by job: best model for business-critical work, cost-controlled for consumer.

OpenRouter

The aggregator that works — one key, 25+ models. How the review seats and the droplet router reach everything without juggling accounts.

✓ keeps working

Anthropic / Claude

Best reasoning, the core build model — but weak connectors (see the matrix below). Strongest brain, clumsiest hands.

~ build-core

OpenAI / GPT (+ Codex)

Strong as an agent, not just a chat box — and the best connectors in the set (GitHub + Notion, see below).

✓ keeps working

GLM-5.2 (z.ai)

Won the 5th review seat — a code-tuned model that out-reviews the generalists at pennies a pass. Replaced Kimi.

✓ earned a seat

Cerebras · Qwen · DeepSeek

Fast / cheap perspectives in the review pipeline; cycled through the weekly eval.

🗓 in rotation

Google / Gemini

As a review seat: constant context trouble — it can't read files, so you must inline the whole diff, and it still reviews against stale/truncated context or returns empty / loops (watched both today). The reasoning-trace-on-timeout is a genuinely useful audit signal; cheap on flash. A perspective, not a workhorse.

~ flaky reviewer

Codex (OpenAI) as a review seat

Fails a lot operationally — CLI / PATH / runtime drops mean it doesn't always run. But when it does, it's the one that reads the actual code: it's corrected invented line numbers and named real functions a plan-only review would miss. High value, low reliability — wire it as a bonus seat, not a dependency.

~ flaky, reads real code

Kimi (Moonshot)

Retired from the factory — GLM-5.2 beat it for the review seat.

✗ retired

Fable 5

Pulled from the factory.

✗ pulled

Hermes (Nous)

Keeps landing on WATCH — and that's the filter working. Its one differentiator (no refusals / "uncensored," neutral alignment) is the opposite of what tax + compliance products need — they want MORE output-side friction, not less. The OpenRouter volume that flags it is roleplay/companion traffic, a market we're not in. And it loses every staffed seat: Haiku won the swarm bench, the review panel already spans 5 families, extraction is deterministic-first. Fails all four admission conditions. (Factory review, 2026-06-13.)

✗ WATCH

E Connector quality — the matrix that matters

"Most of this is making connections, so all of these things matter." Same model, very different hands.

Model / agent	→ GitHub	→ Notion	Overall
OpenAI	great	great	best-in-class connectors
Claude	weak	weak	sucks at connectors (strong everywhere else)
AntiGravity	fine	fine	no problem
Cursor	used	—	in the mix, but privacy-gated for client / tax code (vendor review first)

Codex and Gemini are judged as review seats in section D, not connectors — Codex reads real code but drops a lot; Gemini can't read files and chokes on context.

F The weekly factory eval

New models and tools get tested every week. Most don't make it — and why is the story.

The cadence

Four standing crons watch the frontier (models Sunday, practices + editors Tuesday). Nothing enters the factory on hype — only on a measured pass.

🗓 weekly

The bar

Default to the best model; a challenger earns a seat only by out-reviewing the incumbent on real diffs. Rejections (Hermes) are kept as receipts.

✓ the moat

G The little things that compound

Quality-of-life lessons nobody tells you — the "why didn't I know this" moments.

Open it, don't link it

Never hand back a path to something you saved. Open it in the browser so I can see it. A file I have to go chase is a failure, not a deliverable.

💡 lesson

Go get the screenshot

"I took a screenshot" = the agent goes and finds it, I never dig out a filepath. Newest file wins — desktop in the Screenshots folder, phone in the synced Dropbox folder; I just say which. (Only matters in terminal CC, where you can't drag-drop an image; on upload-capable clients you just paste it.)

💡 lesson

Drag-drop reality

You can't drag-and-drop a file into Claude Code in the terminal — but you can into AntiGravity or Codex. Know each tool's real input affordances.

💡 lesson

Output goes home

Deliverables land in one known output folder and open themselves. No scavenger hunts.

💡 lesson

Don't send me into a menu

"Nothing ever works as you describe, and the menus aren't worth going through — too convoluted." If a task is clickable, do it for me or script it — don't narrate a vendor menu path that won't match what's actually on my screen.

💡 lesson

one thing left (the rest came from your own records)

Real cost numbers — your actual monthly figures, to make the cost comparison land.
Optional: Cursor's GitHub/Notion connector quality, if you ever ran it for that.

Resolved from your own records (no asking): Microsoft & Google (multi-account/tenant hell + convoluted menus), Hostinger (proprietary builder, no export), permissioning (vendor auth + agent modes), the menu lesson — from session transcripts; Hermes WATCH rationale — from the Notion factory review; Codex/Gemini reviewer reliability — from the review ledgers.