Section 01 · the difference
Vibes-eval vs harness.
same agent · two ways to grade it
Vibes-eval ships regressions. A harness catches them before merge.
Section 02 · the golden set, alive
The set grows from production.
hand-curated start · production promotes the rest
Start with 12 you wrote yourself. End with 200 production traces signed off by the eval owner.
Section 03 · failures as training signal
"Models don't fail. They fail confidently."
capture confident-wrong · feed it back as the next golden case
High-confidence failures are the highest-value training signal. Capture them or you lose them.
Section 04 · three scorer types
Exact match, LLM judge, custom rule.
three scorers run in parallel per case
Default to pairwise+swap for the judge. Use the rule for the things the world won't let you get wrong.
Section 04b · the harness rubric
Six dimensions. An independent judge.
vollko / harness · MCP + Claude Code plugin · the worker never judges its own work
Six dimensions, three thresholds. The judge is a different agent than the generator. Harness-design, baked into MCP.
Section 05 · the dual sensor
Deterministic and inferential. Same gate.
two streams · one verdict · never trust just one
Rule-based catches "wrong on its face." Judge catches "wrong on its meaning." Both, or ship blind.
Section 06 · the promotion gate
The PR blocks, or it merges.
CI pipeline · agents promoted only after green eval
Every PR runs the gate. Regression on any sub-suite blocks merge. The eval owner reviews the failure, not the engineer.
Section 07 · the hill climb
The eval set is what you optimize.
harness engineering is its own discipline · the model stays frozen
Each iteration spends harness tokens, never weights. If the curve flattens, the eval set is wrong - not the model.
Section 08 · drift in production
What's changed since launch?
refusal-rate · the canary signal of drift
Refusal-rate and output-length shift before semantics do. Watch them first.
Section 09 · production back to golden set
Tail-based sampling, edit-mining, growth.
drop the boring middle, keep what teaches
Sample where it teaches. The set grows from real failures, not synthetic ones.
Section 10 · tools 2026 · OSS-first
Where the OSS picks land.
programmatic ↑↓ UX-heavy · OSS ←→ paid SaaS
Quadrant 1 (OSS + CI-first) is where the eval harness lives. Anything else is renting.
Section 11 · ways to ship bad evals
Five anti-patterns.
VIBES PROMOTE
"feels better, ship"
TOO SMALL
noise > signal
SINGLE SCORER
one number lies
JUDGE BIAS
position, length
NO DRIFT
silently regresses
Section 12 · vollko OSS · this layer
The primitives.
harness ☆ flagship
quality-gated SOP execution · 6-dimension scoring · event-sourced
agent-attestation
signed receipts that feed outcome loop
agent-toolprint
DSSE+JCS+Ed25519 tool-call receipts
agent-scroll
byte-deterministic transcripts
agent-rerun
reproducibility seed bundles
explain-since
"what changed since T?" primitive
· · ·
Build the AI-native firm