Watch the cohort fall toward the origin as each level adds context. V0 first — then Level 1, then Level 2, then Level 3. Level 4 is the polish; the shape was already set by then.
V0 is the do-nothing baseline. It marks where every agent starts — top-right, far from where we want to be.
V0 baseline
V0 yaw RMSE
0.0163
rad/s
V0 CTE RMSE
254.3
metres
Stage 1
Level 1 agents enter the field.
Ten first-pass models. The pack falls a long way in one step — most of the headroom is in just having any physically-grounded prediction.
V0 baseline
Level 1 agents
Cohort size so far
10 agents
Best yaw RMSE
0.0080
↓ 51.0% vs V0
Best CTE RMSE
108.8 m
↓ 57.2% vs V0
Stage 2
Level 2 — same brief, sharper skill.
Refit the same shape with better fitting choices. The pack tightens; the floor doesn’t move much.
V0 baseline
Level 1 agents
Level 2 agents
Cohort size so far
20 agents
Best yaw RMSE
0.0080
↓ 51.0% vs V0
Best CTE RMSE
103.3 m
↓ 59.4% vs V0
Stage 3
Level 3 — domain knowledge in the prompt.
Vehicle dynamics handed to the agent. The whole cohort shifts down on CTE — the kind of move you don’t get from compute.
V0 baseline
Level 1 agents
Level 2 agents
Level 3 agents
Cohort size so far
30 agents
Best yaw RMSE
0.0070
↓ 56.8% vs V0
Best CTE RMSE
70.4 m
↓ 72.3% vs V0
Trajectory
What each level moved.
Improvement vs V0, level by level. Median is the cohort centre; best is the strongest single agent. Level 4 not shown — the same numbers, with finishing.