Level 02

Skills.

A working definition · The shelf, built up · Further reading · Anatomy of a SKILL.md · The toolkit · AGENTS.md

Definition · 02

Skill

noun /skɪl/

A folder the agent can load on demand. Inside: metadata that says when to use it, instructions that say how, scripts that do the work, and any assets the work needs. ⌁The agent reads the metadata at startup; the rest only enters context when the skill is judged relevant.

01 metadata 02 instructions 03 scripts 04 assets

Anthropic Engineering — Equipping Agents for the Real World with Agent Skills, Dec 2025.
See also Zhang & Murag — Don't Build Agents, Build Skills Instead, AI Engineer Code Summit, late 2025.

Diagram · 03 · The skills shelf, built up 01 / 07

References · Further reading 01 / 02

Talk · AI Engineer Code Summit · late 2025

Don't Build Agents, Build Skills Instead

Barry Zhang & Mahesh Murag — Anthropic

The same Anthropic speaker who gave the canonical Building Effective Agents talk six months earlier publicly course-corrects: one universal agent + a library of skills beats one bespoke agent per domain.

Watch on YouTube

Talk · AI Engineer World's Fair 2025

The New Code

Sean Grove — then OpenAI alignment

"Code is a lossy projection of intent; the specification is the lossless source." Same idea Anthropic ships as Skills coming from the other major frontier lab.
OpenAI's Model Spec and Anthropic's Skills are the same primitive — versioned, clause-addressable markdown authored by domain experts that compiles to documentation, evaluations, prompts, and behaviour.

Watch on YouTube

Anatomy of a SKILL.md

05 · residual-structure / SKILL.md

module-2.v3/agent-01/skills/residual-structure/SKILL.md YAML · frontmatter

---
name: residual-structure
description: After a fit, characterise what's LEFT in the residual —
  temporal autocorrelation at multiple lags, Pearson correlation with
  each input feature AND its first time-derivative, sign-asymmetry in δ.
  Returns a per-platform verdict — either "noise_floor" (stop;
  you're done) or "structure_detected" with a specific reason
  ("residual autocorrelated at lag 6 → try a τ·d(δ)/dt term"). Use as
  the bridge between fit-model and "is V2 worth building?". This is
  the diagnostic the v2 cohort silently lacked — almost everyone
  shipped V1 understeer; the one agent who didn't (m2-agent-05,
  +51.5% yaw) saw exactly this autocorrelation signature and added
  a steering-rate lead.
when-to-invoke: After running `fit-model` and `score-model`, when
  you are trying to decide whether your current model has more headroom
  or you are at the noise floor. Especially when yaw RMSE has stalled
  and you do not know whether to ship or keep iterating.
when-NOT-to-invoke: Before any fit (run scoring-model first — you
  need a fitted predict_fn). To see route-level bias (use route-bias).
  To plot residual vs one feature (use inspect-residuals).
load-cost: ~210 tokens metadata, ~500 tokens body.
---

← Click a highlighted YAML field

×

A · description

Carries judgement, not mechanics.

Names the verdict the skill produces — and the failure mode it prevents — not the function it calls. Where your organisation's expertise lives.

×

B · when-NOT-to-invoke

Routes the agent.

Three explicit redirects. Without them the agent guesses which skill to load; with them, it routes.

×

C · load-cost

Prices the abstraction.

Metadata is paid every turn; body only when this skill activates. Without the receipt, you can't budget the eleventh skill.

The toolkit — 10 skills

06 · env-template-m2 / skills/

Oracle · inner loop

score-model KPIs + diagnostics for any predict().

fit-model Tune per-platform coefficients with scipy.

Diagnose · what's left

residual-structure Verdict — noise floor or more to model?

route-bias Rank routes by share of pooled error.

inspect-residuals Plot residual vs 1 or 2 input features.

compare-models Diff two predict()s segment-by-segment.

visualise-segment 3-panel PNG: trajectory · yaw · residual.

Prep · data plumbing

load-segments sim.csv → DataFrames with provenance.

make-train-dev-split Route-grouped split — no leakage.

Ship · the gate

pre-flight-final-model Verify the final-model/ bundle contract.

ORACLE · INNER LOOP

score-model

The inner-loop oracle.

Runs any predict() over the segments and returns one structured bundle — pooled yaw + CTE RMSE, per-segment / per-platform / per-route / per-regime tables, residual stats, worst-N outliers, and a signed-bias warning block at the top. Schema-aware: resolves the ground-truth column per platform, so Tesla (and any non-default schema) scores instead of being silently skipped.

when to invoke Every iteration. This is the oracle — read its bias-warnings block before deciding anything else.

The orientation file — AGENTS.md

07 · env-template-m2 / AGENTS.md

env-template-m2/AGENTS.md markdown · project root

## Working directory layout

Maps the workspace — skills/, _shared/, data/, code/, final-model/. So the agent knows where things live before it touches anything.

## Skills inventory

One paragraph per skill — what it returns, what to read first, the cohort lesson baked in. The agent learns the toolkit without opening ten SKILL.md files.

## Suggested loop

Score → fit → diagnose → iterate the model, not the fit. Encodes how skills chain — the workflow the cohort discovered, frozen as instructions.

## Don't ship V1

The specific failure mode that ceiling-ed the v2 cohort. "After your first fit, run residual-structure and build a second candidate." Hard-won lesson, written down once.

## Working with skills

"Skills are clay, not library." If a skill is in the way — delete it. The only obligation is to lower the canonical KPIs.

AGENTS.md is the project's orientation file — the agent reads it once at startup, before any skill metadata is even scanned. It's the manifest that turns a folder into a workspace.

01

Always in context.

Loaded every turn — unlike skill bodies. So it has to be tight. Anything you put here is paid for forever; anything you leave for a SKILL.md body only enters when needed.

02

Names the toolkit.

The agent doesn't have to ls skills/ and guess. AGENTS.md lists every skill with its judgement — same writing discipline as a SKILL.md description, just one level up.

03

Encodes the loop.

Tells the agent how the skills chain — score → fit → diagnose → iterate. Without this, the agent picks an arbitrary order and burns turns rediscovering the workflow.

04

Carries the hard-won lessons.

"Don't ship V1" is not advice — it's a recorded failure mode. AGENTS.md is where the cohort's institutional memory lives, so the next agent doesn't repeat the same ceiling.

05

Sets the contract.

Skills are clay, not library. The agent has permission to modify, extend, or delete any skill. The only obligation is the KPI. AGENTS.md is where that permission is granted.