A short markdown document the agent loads on demand. It carries the judgement the codebase can't — the traps prior work hit, the levers worth pulling, the why behind the rule, with a worked example. Frontmatter tells the agent when to load it; the body only enters context when the moment matches. ⚙Every new failure your team sees becomes a new line.
Mitchell Hashimoto — My AI Adoption Journey, Feb 2026 (the ratchet method);
Sean Grove — The New Code, AI Engineer World's Fair 2025 (spec as lossless source of intent);
BettaTech — ¿Qué es esto del Harness Engineering?, 2026 (guides vs sensors).
Mitchell Hashimoto — HashiCorp co-founder, mitchellh.com.
Synthesised into the harness engineering vocabulary by BettaTech (Spanish-language YouTube, late April 2026).
--- name: anti-patterns description: Common ways prior work on this task has gone wrong. Lead with these — most of them are not obvious from the data alone. when-to-load: Before you settle on a fitting procedure or evaluation slice. load-cost: ~600 words. ---The legal cousin — per-segment δ₀ from input channels (this is THE winning move on the right platforms)
This is the single highest-leverage move on this dataset. In the most recent m3 cohort, the three top-tier agents all shipped it; the three bottom-tier agents all didn't — and the gap was +8 pts yaw / +15 pts CTE between tiers, with model form otherwise identical.
The frontmatter names the role of the doc, not its contents — "common ways prior work has gone wrong." The body opens with a worked example whose first paragraph is cohort evidence: not a principle, an outcome with numbers attached.
Grove's framing applies here: the reference is the lossless source of judgement. The code that implements δ₀ correction is a lossy projection of the insight that bottom-tier agents reliably miss it. The reference carries the insight; the code carries the implementation.
--- name: exploration-discipline description: Protocol for naming ≥5 alternatives (at least 3 different model structures) before committing to one, plus the EXPERIMENTS.md log convention. Prevents silent re-convergence on the same approach prior cohorts piled up on. when-to-load: At the start of a fresh task, before your first fit. Re-read whenever you're tempted to "just iterate on the current model". --- Every EXPERIMENTS.md entry MUST carry a Rung: 0|1|2|3|orthogonal tag. The pre-flighting-final-model skill enforces at least one Rung: 1+ or Rung: orthogonal entry before the bundle can ship.
A reference doesn't have to teach — it can prescribe. This one is a procedure: name five alternatives, log them, tag the rung, and the harness will refuse to ship if you skipped the climb.
That last sentence is the ratchet in action — a prior cohort failed (every agent piled up on rung-0 refinements), so the harness was modified to prevent that failure from recurring. The reference doc is the human-readable face of the same change. References and skills co-evolve with the failures they exist to prevent.
--- name: dynamics-formulations description: V0 documented in full plus sketches of higher-rung formulations (linear dynamic ST with slip angles, nonlinear tyre, multi-body). when-to-load: When choosing a model structure, or when residual-structure flags `structure_detected`. Living doc — append your formulation here when you ship one past V0. ---Minimum viable rung-1 attempt
A ~30-line code scaffold (Euler integration, fix all params from carParams except C_αf, fit per platform). The cost-to-attempt is lower than past cohorts assumed.
Some references are append-only catalogues — every agent that ships a successful new formulation adds an entry. This is the artifact-level analogue of skill files as the unit of recursive self-improvement.
The reference grows as the team's vocabulary for the problem grows. Any markdown artifact in the harness can learn this way — AGENTS.md, individual skills, and references all participate in the same ratchet at different grains.
--- name: two-kpi-tradeoff description: How yaw-rate RMSE and CTE RMSE relate. Two-step diagnostic for "yaw improved but CTE stuck". when-to-load: After you have a working model and want to interpret your numbers. ---Failure-mode index
☐ Yaw RMSE improved >30% but CTE barely moved → check per-platform signed bias; a symmetric error distribution survives RMSE improvements but ships as drift. ☐ Pooled score improved, per-platform got worse on one → you fit pooled but evaluated pooled; check the per- platform table. ☐ Dev RMSE matches train RMSE exactly → you split at the sample level inside a segment (route leakage). Re-split.
Every reference closes with a failure-mode index — a checklist of "you'll see this if…" patterns. This is the Husain pattern from production trace analysis, applied at authoring time: the moment a failure has surfaced often enough to characterise, it earns a checkbox here.
The index is what makes the reference useful at the moment of decision, not just at the moment of reading. The agent runs through it after every fit; the user does too.