BLOCK 2 · BUILD & EVAL A SKILL11:40 – 12:03 · 23 min · 13L / 10HO
TOPIC 2.3 / 3
Optimizer & Ceiling
Auto-iterate the description field on a held-out test set. Read what the optimizer's
data tells us — gap, plateau, or otherwise. Pull a Block 1 callback (CoT) up to the tool altitude.
Knowing the ceiling beats chasing another ten percent.
2.3.1
Launch the optimizer
5 minHands-on
Slide 1 / 2 · What it does
Auto-iterate the description field
Body stays fixed; only description changes
Scored against trigger-eval.json — 20 queries
10 should-trigger, 10 should-not-trigger (near-misses on both sides)
Runs ~8–10 min in the background. We'll narrate while it runs.
2.3.2
Narration while it runs
5 minLecture
Slide 1 / 3 · The loop
What the optimizer is doing right now
current description
↓ evaluate on 12 train + 8 test queries (3 runs each)
trigger rates + failures
↓ ask Claude to propose improved description (failures as context)
new description
↓ evaluate on same 12 + 8 (same 3-run protocol)
↓ repeat 5 times
pick description with best TEST score
Slide 2 / 3 · Overfit
Why select on test, not train
Train score = how well you did on queries you've already seen
Test score = how well you'll do on queries you haven't
Select on train → great on your 12 queries, bad on everything else
Select on test → generalizes
Same idea as ML train/test splits. Smaller scale.
Slide 3 / 3 · Production stakes
Triggering accuracy is the #1 production failure mode
A skill that never triggers when it should = doesn't exist
A skill that triggers on the wrong things = worse (pollutes other conversations)
Description-engineering is its own discipline
The plugin does it for you in 10 minutes
2.3.3
Read the plateau
6 minHands-on
Reframed beat (Path B, post-dry-run 2026-05-18). The optimizer plateaued — 5 iterations all tied at 6/12 train · 4/8 test · 0/10 positive triggers. Story shifted from "see the train-test gap" to "see what the optimizer can't reach."
Every iteration. No gap. No winner. Optimizer kept iter 1 (ties go to original).
The story I expected isn't the story this report has.
Slide 3 / 5 · Where the score is hiding
Zero sensitivity. Hundred percent specificity.
0/10 positive queries triggered — across all 5 iterations
10/10 negative queries correctly didn't trigger
The skill is dormant on real-world language
Slide 4 / 5 · Iter 1 vs Iter 5
A better description didn't move the score
ITER 1 — ORIGINAL
Point at any code repo. Produces a structured briefing — architecture, entry points, data flow, key dependencies, test patterns. Use this whenever the user asks to understand, summarize, onboard to, or get a briefing on a codebase, even if they don't say "inspect."
ITER 5 — REWRITTEN
Invoke for any request where the user wants to understand how an existing codebase works — not change it, not compare frameworks, but map it. Classic triggers: lost in an inherited repo, exploring a client's legacy app, onboarding to an unfamiliar service.
The optimizer wrote a better description. The score didn't move. The description isn't the bottleneck.
Slide 5 / 5 · What this tells us
Optimizers can only optimize what they touch
The description was the lever. It got pulled. The score didn't budge.
Bottleneck lives elsewhere: trigger detection, eval harness, model config
That's the lesson the optimizer didn't intend to teach but did anyway
2.3.4
CoT for tools · Block 1 callback
3 minLectureDoctrine close
Slide 1 / 3 · Block 1 callback
CoT — why it works
Don't trust the model to invent the right thinking steps
You tell it: "Think step by step. First X, then Y, then Z."
You write the structure. The model fills it in.
Slide 2 / 3 · Same principle, one altitude up
Tool-CoT — when triggering hits its ceiling
Don't trust the model to invent the right tool either
Tell it: "Use the project-inspector skill on this repo."
Auto-trigger is convenient. Explicit invocation is reliable. In production, you do both.
Slide 3 / 3 · Two patterns. You already use one.
Direct what the AI writes — direct which tool it uses
CoT
·
Explicit invocation
The plateau is honest data. Description engineering has limits. Knowing where they are before you ship is worth more than another ten percent on a triggering benchmark.