Eval, harnessed · the diagrams

Section 01 · the difference

Vibes-eval vs harness.

same agent · two ways to grade it

Vibes-eval ships regressions. A harness catches them before merge.

Section 02 · the golden set, alive

The set grows from production.

hand-curated start · production promotes the rest

Start with 12 you wrote yourself. End with 200 production traces signed off by the eval owner.

Section 03 · failures as training signal

"Models don't fail. They fail confidently."

capture confident-wrong · feed it back as the next golden case

High-confidence failures are the highest-value training signal. Capture them or you lose them.

Section 04 · three scorer types

Exact match, LLM judge, custom rule.

three scorers run in parallel per case

Default to pairwise+swap for the judge. Use the rule for the things the world won't let you get wrong.

Section 04b · the harness rubric

Six dimensions. An independent judge.

vollko / harness · MCP + Claude Code plugin · the worker never judges its own work

Six dimensions, three thresholds. The judge is a different agent than the generator. Harness-design, baked into MCP.

Section 05 · the dual sensor

Deterministic and inferential. Same gate.

two streams · one verdict · never trust just one

Rule-based catches "wrong on its face." Judge catches "wrong on its meaning." Both, or ship blind.

Section 06 · the promotion gate

The PR blocks, or it merges.

CI pipeline · agents promoted only after green eval

Every PR runs the gate. Regression on any sub-suite blocks merge. The eval owner reviews the failure, not the engineer.

Section 07 · the hill climb

The eval set is what you optimize.

harness engineering is its own discipline · the model stays frozen

Each iteration spends harness tokens, never weights. If the curve flattens, the eval set is wrong - not the model.

Section 08 · drift in production

What's changed since launch?

refusal-rate · the canary signal of drift

Refusal-rate and output-length shift before semantics do. Watch them first.

Section 09 · production back to golden set

Tail-based sampling, edit-mining, growth.

drop the boring middle, keep what teaches

Sample where it teaches. The set grows from real failures, not synthetic ones.

Section 10 · tools 2026 · OSS-first

Where the OSS picks land.

programmatic ↑↓ UX-heavy · OSS ←→ paid SaaS

Quadrant 1 (OSS + CI-first) is where the eval harness lives. Anything else is renting.

Section 11 · ways to ship bad evals

Five anti-patterns.

VIBES PROMOTE

"feels better, ship"

TOO SMALL

noise > signal

SINGLE SCORER

one number lies

JUDGE BIAS

position, length

NO DRIFT

silently regresses

Section 12 · vollko OSS · this layer

The primitives.

harness ☆ flagship

quality-gated SOP execution · 6-dimension scoring · event-sourced

agent-attestation

signed receipts that feed outcome loop

agent-toolprint

DSSE+JCS+Ed25519 tool-call receipts

agent-scroll

byte-deterministic transcripts

agent-rerun

reproducibility seed bundles

explain-since

"what changed since T?" primitive

· · ·

Build the AI-native firm

Directory · pick a face

← back to the AI-native organization whitepaper

Eval, harnessed.