Block 2 / Topic 2.3 — Description optimizer + recap

2.3.1

Launch the optimizer

5 min Hands-on

Slide 1 / 2 · What it does

Auto-iterate the `description` field

Body stays fixed; only description changes
Scored against trigger-eval.json — 20 queries
10 should-trigger, 10 should-not-trigger (near-misses on both sides)
5 iterations, 60/40 train/test split

Slide 2 / 2 · Launch

Launch it now

python -m scripts.run_loop \
  --eval-set evals/trigger-eval.json \
  --skill-path skills/project-inspector \
  --model claude-sonnet-4-6 \
  --max-iterations 5 \
  --verbose

Runs ~8–10 min in the background. We'll narrate while it runs.

2.3.2

Narration while it runs

5 min Lecture

Slide 1 / 3 · The loop

What the optimizer is doing right now

current description
    ↓ evaluate on 12 train + 8 test queries (3 runs each)
trigger rates + failures
    ↓ ask Claude to propose improved description (failures as context)
new description
    ↓ evaluate on same 12 + 8 (same 3-run protocol)
    ↓ repeat 5 times
pick description with best TEST score

Slide 2 / 3 · Overfit

Why select on test, not train

Train score = how well you did on queries you've already seen
Test score = how well you'll do on queries you haven't
Select on train → great on your 12 queries, bad on everything else
Select on test → generalizes
Same idea as ML train/test splits. Smaller scale.

Slide 3 / 3 · Production stakes

Triggering accuracy is the #1 production failure mode

A skill that never triggers when it should = doesn't exist
A skill that triggers on the wrong things = worse (pollutes other conversations)
Description-engineering is its own discipline
The plugin does it for you in 10 minutes

Speaker notes

Slide 1 — describe the loop: "Iteration 1: evaluate the current description, get a trigger rate per query, log every miss. Call Claude with the failures as context, ask it to propose an improved description. Iteration 2: evaluate the new description. Repeat five times. At the end: pick the description with the best test score — not train."

Slide 2 — the overfit teaching moment: "Why test, not train? Overfit. The third pathology we set up at the start. If you select on train score, you'll pick a description that's great for the 12 queries you wrote and bad for everything else. The held-out test is the discipline that prevents that. Same idea as machine-learning train/test splits."

Slide 3 — production stakes: "Triggering accuracy is the number-one production failure mode for skills. A skill that never triggers when it should is a skill that doesn't exist. A skill that triggers on the wrong things is worse — it pollutes other conversations. Description-engineering is its own discipline. The plugin does it for you in ten minutes."

Then: "Any questions while the optimizer runs?" Take 1–2 from the room.

2.3.3

Read the plateau

6 min Hands-on

Reframed beat (Path B, post-dry-run 2026-05-18). The optimizer plateaued — 5 iterations all tied at 6/12 train · 4/8 test · 0/10 positive triggers. Story shifted from "see the train-test gap" to "see what the optimizer can't reach."

Slide 1 / 5 · Reading the report

Five rows, one per iteration

Proposed description (full text)
Train score (12 queries × 3 runs = 36 trigger decisions)
Test score (8 queries × 3 runs = 24 trigger decisions)
The gap between train and test — if there is one

Slide 2 / 5 · The plateau

Five iterations. Same score.

6/12 train · 4/8 test

Every iteration. No gap. No winner. Optimizer kept iter 1 (ties go to original).

The story I expected isn't the story this report has.

Slide 3 / 5 · Where the score is hiding

Zero sensitivity. Hundred percent specificity.

0/10 positive queries triggered — across all 5 iterations
10/10 negative queries correctly didn't trigger
The skill is dormant on real-world language

Slide 4 / 5 · Iter 1 vs Iter 5

A better description didn't move the score

ITER 1 — ORIGINAL

Point at any code repo. Produces a structured briefing — architecture, entry points, data flow, key dependencies, test patterns. Use this whenever the user asks to understand, summarize, onboard to, or get a briefing on a codebase, even if they don't say "inspect."

ITER 5 — REWRITTEN

Invoke for any request where the user wants to understand how an existing codebase works — not change it, not compare frameworks, but map it. Classic triggers: lost in an inherited repo, exploring a client's legacy app, onboarding to an unfamiliar service.

The optimizer wrote a better description. The score didn't move. The description isn't the bottleneck.

Slide 5 / 5 · What this tells us

Optimizers can only optimize what they touch

The description was the lever. It got pulled. The score didn't budge.
Bottleneck lives elsewhere: trigger detection, eval harness, model config
That's the lesson the optimizer didn't intend to teach but did anyway

Speaker notes

Slide 1: "Report's up. Five rows, one per iteration. Train and test scores per row. The story I planned to tell here — 'see the train-test gap, that's overfit' — isn't the story this report has. Look at the scores."

Slide 2: "Every iteration scored the same. Six on twelve train. Four on eight test. Five iterations. Flat. No gap to point at. No winning iteration to celebrate — the optimizer kept iteration one because tied scores go to the original."

Slide 3 — drill into per-query: "Of ten positive queries — the should-trigger cases — the skill fired on zero. Out of thirty attempts per iteration. Of ten negative queries — should-not-trigger — the skill correctly didn't fire on all of them. Hundred percent specificity. Zero sensitivity. The skill is dormant on real-world language."

Slide 4 — read iter 5 description aloud: "Look at iteration five's description. [Read aloud from the right column.] That is visibly better-tuned to queries like 'I just inherited this Nuxt project' than iteration one's description. The optimizer wrote a better description. The score didn't move. The description isn't the bottleneck."

Slide 5 — sit with the lesson: "What the optimizer told us — by giving us a plateau — is that the bottleneck lives somewhere else. Maybe in the trigger detection itself. Maybe in how Claude weighs descriptions against natural language. Maybe the eval harness needs a different config. Honest answer: we'd need to investigate to know for sure. And that investigation is the lesson. Optimizers can only optimize what they're allowed to touch."

2.3.4

CoT for tools · Block 1 callback

3 min Lecture Doctrine close

Slide 1 / 3 · Block 1 callback

CoT — why it works

Don't trust the model to invent the right thinking steps
You tell it: "Think step by step. First X, then Y, then Z."
You write the structure. The model fills it in.

Slide 2 / 3 · Same principle, one altitude up

Tool-CoT — when triggering hits its ceiling

Don't trust the model to invent the right tool either
Tell it: "Use the project-inspector skill on this repo."
Auto-trigger is convenient. Explicit invocation is reliable. In production, you do both.

Slide 3 / 3 · Two patterns. You already use one.

Direct what the AI writes — direct which tool it uses

CoT

·

Explicit invocation

The plateau is honest data. Description engineering has limits. Knowing where they are before you ship is worth more than another ten percent on a triggering benchmark.

Speaker notes

Slide 1: "Block 1. Chain-of-thought. Why does CoT work? Because you don't trust the model to invent the right thinking steps on its own. You tell it. 'Think step by step. First do X, then Y, then Z.' You write the structure of the reasoning. The model fills it in."

Slide 2: "Now scale that idea up one level. Skill triggering. Why would you trust the model to invent the right tool for the job? You wouldn't. You shouldn't. Same principle. Tell it. 'Use the project-inspector skill on this repo.' 'Invoke the test-runner.' The plateau we just looked at is the data showing you the ceiling on description-only triggering. Auto-trigger is convenient. Explicit invocation is reliable. In production, you do both."

Slide 3: "You already do this with CoT — you write the thinking steps because you don't trust the model to know them. Same move with skills: write the description because you want it to trigger; invoke explicitly when it has to trigger. The optimizer's plateau is honest data: description engineering has limits. Knowing where the limits are before you ship is worth more than another ten percent on a triggering benchmark."

2.3.5

Recap

2 min Lecture

Slide 1 / 2 · The journey

One artifact, seven stages

v0 prompt ("summarize this codebase") — Block 1.2
v5 prompt (AI-as-Coach, 5-component rubric) — Block 1.2
v5 → working SKILL.md — Beat A
Benchmarked vs baseline — Beats C–E
Eval suite itself evaluated; real skill bug + 3 non-discriminating assertions — Beat F
Coached the skill — Beat G
Ran the optimizer · learned its ceiling — Topic 2.3

Slide 2 / 2 · Four lessons

Four lessons in seventy-five minutes

Does the output work?
Do the tests test anything?
Does the description help — and where does it stop?
When triggering hits its ceiling, do you invoke explicitly?

One artifact. Four altitudes. That's the discipline.

Speaker notes

Slide 1: "Trace the artifact. v0 — naked prompt, Block 1.2. v5 — same prompt, enhanced via AI-as-Coach this morning, five components as your rubric. v5 became a skill in Beat A. The skill got benchmarked, Beats C–E. The eval suite itself got evaluated in Beat F — and we caught a real skill bug and three non-discriminating assertions in one pass. We coached the skill in Beat G. And finally we ran the optimizer and learned its ceiling. The data told us auto-triggering has limits."

Slide 2 — the workshop-morning anchor: "One artifact. Four lessons. Does the output work? Do the tests test anything? Does the description help — and where does it stop helping? And when triggering hits its ceiling, do you invoke explicitly? Seventy-five minutes. That's the discipline."

2.3.6

Bridge to Sam

2 min Lecture

Slide 1 / 2 · After lunch

Block 3 — Sam takes the stage

Vibe coding → agentic engineering
How working engineers compose skills like this every day
Superpowers, beads, hooks, multi-CLI verification, token discipline
The discipline you just learned doesn't go away — the workflow around it deepens

Slide 2 / 2 · Lunch

See you at one

"The plugin you used today is the training-wheels version of what Sam does without thinking."

Lunch · 12:00 – 1:00 · See you at 1 PM

Auto-iterate the description field

Launch it now

What the optimizer is doing right now

Why select on test, not train

Triggering accuracy is the #1 production failure mode

Five rows, one per iteration

Five iterations. Same score.

Zero sensitivity. Hundred percent specificity.

A better description didn't move the score

Optimizers can only optimize what they touch

CoT — why it works

Tool-CoT — when triggering hits its ceiling

Direct what the AI writes — direct which tool it uses

One artifact, seven stages

Four lessons in seventy-five minutes

Block 3 — Sam takes the stage

See you at one

Auto-iterate the `description` field