Skip to content
vollko
Main
Homepage Engineering Transformation Whitepaper OSS catalog
The trace · deep dives
01 · sense
sensing-ingestion
02 · substrate · memory & identity
knowledge-graphs agent-memory agent-identity observability
03 · cognition · the firm thinks
agent-frameworks orchestration eval-harness protocols
04 · trust + learning
governance feedback-loops
05 · synthesis · one trace
end-to-endStart a conversation
AI-native · substrate

Eval, harnessed.

Golden sets, scorers, gates, drift - the substrate that lets you change an agent without breaking what worked yesterday. Drawn, not described.

GOLDEN SET test #001 test #002 test #003 test #004 test #005 test #006 ... 200 cases AGENT v7 under test SCORER · exact 0.94 SCORER · judge 0.87 SCORER · rule PASS GATE: PASS avg 0.91 · promote
Section 01 · the difference

Vibes-eval vs harness.

same agent · two ways to grade it
VIBES-EVAL "try one query" "looks good to me" "try another" "yeah, ship it" N = 4 · not auditable regresses silently in production HARNESS exact judge rule overall #001 .94 .91 ok .92 #002 .88 .79 ok .84 #003 .90 .41 ok .65 #004 .96 .93 ok .94 ... #200 .85 .82 ok .83 aggregate 0.87 · gate PASS N = 200 · auditable · regression-blocking
Vibes-eval ships regressions. A harness catches them before merge.
Section 02 · the golden set, alive

The set grows from production.

hand-curated start · production promotes the rest
WK 1 WK 4 WK 8 WK 12 WK 16 WK 20 12 28 52 78 103 127 148 164 178 203 hand-curated promotion from prod begins covers ~90% of intent classes CASES
Start with 12 you wrote yourself. End with 200 production traces signed off by the eval owner.
Section 03 · failures as training signal

"Models don't fail. They fail confidently."

capture confident-wrong · feed it back as the next golden case
AGENT RUNconfidence: 0.94 EVAL VERDICTWRONG FAILURE NOTE input:expected:got:confidence:why_wrong: GOLDEN SET +1regression test for next run PROMPT REVISIONthe agent rewrites itself → same agent, next iteration
High-confidence failures are the highest-value training signal. Capture them or you lose them.
Section 04 · three scorer types

Exact match, LLM judge, custom rule.

three scorers run in parallel per case
01 · EXACT MATCH structured outputs EXPECTED renew · $12,500 · 1 year what we want the agent to say ACTUAL renew · $12,500 · 1 year what the agent said 1.00 · MATCH 02 · LLM JUDGE open-ended outputs RUBRIC faithful? 0-1 concise? 0-1 cites sources? 0-1 on-policy? 0-1 + POSITION SWAP A > B 0.65 B > A (swap) 0.62 0.87 · PASS 03 · CUSTOM RULE domain assertions THE RULES amount must be positive no negative or zero numbers currency must be allowed only EUR / USD / RON for this account term within limits 1 month minimum · 5 years maximum PASS
Default to pairwise+swap for the judge. Use the rule for the things the world won't let you get wrong.
Section 04b · the harness rubric

Six dimensions. An independent judge.

vollko / harness · MCP + Claude Code plugin · the worker never judges its own work
6-DIM RUBRIC · weighted 0 10 20 30 complete25% specific20% correct20% actionable15% coherent10% format10% PROFILE · THRESHOLD lenient 3.0 default 3.5 strict 4.0 · production gate EVALUATOR BACKENDS · subagent (default, no API key) · anthropic API (Haiku) · OpenAI-compatible (vLLM / Ollama) AI SLOP DETECTION 15+ flagged patterns · 3+ hits auto-penalize specificity SOP TEMPLATES feature-dev 4×12 · investigation 3×7 code-review 3×7 · _TEMPLATE
Six dimensions, three thresholds. The judge is a different agent than the generator. Harness-design, baked into MCP.
Section 05 · the dual sensor

Deterministic and inferential. Same gate.

two streams · one verdict · never trust just one
DETERMINISTIC ✓ linter / type-check✓ schema match · JSON valid✓ unit tests · regex · numeric range fast · certain · cheap INFERENTIAL ≈ LLM-as-judge · pairwise≈ semantic similarity≈ rubric-graded helpfulness slow · probabilistic · expensive GATE AND-merge both must pass det wins ties PROMOTEship to next stage BLOCKcapture failure · grow golden set
Rule-based catches "wrong on its face." Judge catches "wrong on its meaning." Both, or ship blind.
Section 06 · the promotion gate

The PR blocks, or it merges.

CI pipeline · agents promoted only after green eval
01 · PR prompt change v7 → v8 02 · SMOKE 30 cases · 20s 03 · FULL SUITE 200 cases · 6min 04 · REGRESSION vs v7 baseline no sub-suite regressed 05 · GATE MERGE OK avg 0.87 · ship if any sub-suite regresses PR BLOCKED
Every PR runs the gate. Regression on any sub-suite blocks merge. The eval owner reviews the failure, not the engineer.
Section 07 · the hill climb

The eval set is what you optimize.

harness engineering is its own discipline · the model stays frozen
PASS RATE × ITERATION 0% 25% 50% 75% 100% v1v2v3v4v5v6 train holdout start low (else no hill) GRADER LOOP GENERATOR artifact GRADERrubric · why regenerate (below threshold) SHIP starting low gives a hill to climb · holdout guards against overfit · grader = a second agent
Each iteration spends harness tokens, never weights. If the curve flattens, the eval set is wrong - not the model.
Section 08 · drift in production

What's changed since launch?

refusal-rate · the canary signal of drift
15% 10% 5% 0% REFUSAL RATE D1 D7 D14 D21 D28 D35 baseline ~3% alert > 8% refusals 3× baseline - investigate
Refusal-rate and output-length shift before semantics do. Watch them first.
Section 09 · production back to golden set

Tail-based sampling, edit-mining, growth.

drop the boring middle, keep what teaches
01 · PROD TRAFFIC 10,000/day 02 · TAIL FILTER keep: · user edited · tool retried · judge score < .7 drop the rest 03 · CANDIDATES ~ 120/day 1.2% sampled 04 · HUMAN REVIEW eval owner promote: 18 label: 42 reject: 60 05 · GOLDEN + +18 cases/day via PR 10,000 120 120 18 promoted +18 to set
Sample where it teaches. The set grows from real failures, not synthetic ones.
Section 10 · tools 2026 · OSS-first

Where the OSS picks land.

programmatic ↑↓ UX-heavy  ·  OSS ←→ paid SaaS
paid SaaS OSS programmatic / CI-first UX-first / dashboard OSS · CI-first <-- ideal paid · CI-first OSS · UX-first paid · UX-first Inspect UK AISI · 200+ evals Promptfoo MIT (under OpenAI) DeepEval 14+ metrics · pytest Phoenix arize-phoenix-evals 3.1 LangSmith Agent Sandboxing Braintrust skip · $80M SaaS Patronus AI Lynx · GLIDER harness vollko · flagship
Quadrant 1 (OSS + CI-first) is where the eval harness lives. Anything else is renting.
Section 11 · ways to ship bad evals

Five anti-patterns.

vibes
VIBES PROMOTE
"feels better, ship"
N=10
TOO SMALL
noise > signal
1 scorer
SINGLE SCORER
one number lies
A > B always never swap
JUDGE BIAS
position, length
no monitor
NO DRIFT
silently regresses
Section 12 · vollko OSS · this layer

The primitives.

· · ·
Build the AI-native firm