Designed for tablet+. View on a tablet or larger screen for the intended layout.
CORE_WORKSHOP_v1.0
SPEAKER NOTES MODE — press s to hide
BLOCK 2 · BUILD & EVAL A SKILL 11:40 – 12:03 · 23 min · 13L / 10HO

TOPIC 2.3 / 3

Optimizer & Ceiling

Auto-iterate the description field on a held-out test set. Read what the optimizer's data tells us — gap, plateau, or otherwise. Pull a Block 1 callback (CoT) up to the tool altitude. Knowing the ceiling beats chasing another ten percent.

2.3.1

Launch the optimizer

5 min Hands-on
Slide 1 / 2 · What it does

Auto-iterate the description field

  • Body stays fixed; only description changes
  • Scored against trigger-eval.json — 20 queries
  • 10 should-trigger, 10 should-not-trigger (near-misses on both sides)
  • 5 iterations, 60/40 train/test split
Slide 2 / 2 · Launch

Launch it now

python -m scripts.run_loop \
  --eval-set evals/trigger-eval.json \
  --skill-path skills/project-inspector \
  --model claude-sonnet-4-6 \
  --max-iterations 5 \
  --verbose

Runs ~8–10 min in the background. We'll narrate while it runs.

2.3.2

Narration while it runs

5 min Lecture
Slide 1 / 3 · The loop

What the optimizer is doing right now

current description
    ↓ evaluate on 12 train + 8 test queries (3 runs each)
trigger rates + failures
    ↓ ask Claude to propose improved description (failures as context)
new description
    ↓ evaluate on same 12 + 8 (same 3-run protocol)
    ↓ repeat 5 times
pick description with best TEST score
Slide 2 / 3 · Overfit

Why select on test, not train

  • Train score = how well you did on queries you've already seen
  • Test score = how well you'll do on queries you haven't
  • Select on train → great on your 12 queries, bad on everything else
  • Select on test → generalizes
  • Same idea as ML train/test splits. Smaller scale.
Slide 3 / 3 · Production stakes

Triggering accuracy is the #1 production failure mode

  • A skill that never triggers when it should = doesn't exist
  • A skill that triggers on the wrong things = worse (pollutes other conversations)
  • Description-engineering is its own discipline
  • The plugin does it for you in 10 minutes
2.3.3

Read the plateau

6 min Hands-on

Reframed beat (Path B, post-dry-run 2026-05-18). The optimizer plateaued — 5 iterations all tied at 6/12 train · 4/8 test · 0/10 positive triggers. Story shifted from "see the train-test gap" to "see what the optimizer can't reach."

Slide 1 / 5 · Reading the report

Five rows, one per iteration

  • Proposed description (full text)
  • Train score (12 queries × 3 runs = 36 trigger decisions)
  • Test score (8 queries × 3 runs = 24 trigger decisions)
  • The gap between train and test — if there is one
Slide 2 / 5 · The plateau

Five iterations. Same score.

6/12 train · 4/8 test

Every iteration. No gap. No winner. Optimizer kept iter 1 (ties go to original).

The story I expected isn't the story this report has.

Slide 3 / 5 · Where the score is hiding

Zero sensitivity. Hundred percent specificity.

  • 0/10 positive queries triggered — across all 5 iterations
  • 10/10 negative queries correctly didn't trigger
  • The skill is dormant on real-world language
Slide 4 / 5 · Iter 1 vs Iter 5

A better description didn't move the score

ITER 1 — ORIGINAL

Point at any code repo. Produces a structured briefing — architecture, entry points, data flow, key dependencies, test patterns. Use this whenever the user asks to understand, summarize, onboard to, or get a briefing on a codebase, even if they don't say "inspect."

ITER 5 — REWRITTEN

Invoke for any request where the user wants to understand how an existing codebase works — not change it, not compare frameworks, but map it. Classic triggers: lost in an inherited repo, exploring a client's legacy app, onboarding to an unfamiliar service.

The optimizer wrote a better description. The score didn't move. The description isn't the bottleneck.

Slide 5 / 5 · What this tells us

Optimizers can only optimize what they touch

  • The description was the lever. It got pulled. The score didn't budge.
  • Bottleneck lives elsewhere: trigger detection, eval harness, model config
  • That's the lesson the optimizer didn't intend to teach but did anyway
2.3.4

CoT for tools · Block 1 callback

3 min Lecture Doctrine close
Slide 1 / 3 · Block 1 callback

CoT — why it works

  • Don't trust the model to invent the right thinking steps
  • You tell it: "Think step by step. First X, then Y, then Z."
  • You write the structure. The model fills it in.
Slide 2 / 3 · Same principle, one altitude up

Tool-CoT — when triggering hits its ceiling

  • Don't trust the model to invent the right tool either
  • Tell it: "Use the project-inspector skill on this repo."
  • Auto-trigger is convenient. Explicit invocation is reliable. In production, you do both.
Slide 3 / 3 · Two patterns. You already use one.

Direct what the AI writes — direct which tool it uses

CoT
·
Explicit invocation

The plateau is honest data. Description engineering has limits. Knowing where they are before you ship is worth more than another ten percent on a triggering benchmark.

2.3.5

Recap

2 min Lecture
Slide 1 / 2 · The journey

One artifact, seven stages

  1. v0 prompt ("summarize this codebase") — Block 1.2
  2. v5 prompt (AI-as-Coach, 5-component rubric) — Block 1.2
  3. v5 → working SKILL.md — Beat A
  4. Benchmarked vs baseline — Beats C–E
  5. Eval suite itself evaluated; real skill bug + 3 non-discriminating assertions — Beat F
  6. Coached the skill — Beat G
  7. Ran the optimizer · learned its ceiling — Topic 2.3
Slide 2 / 2 · Four lessons

Four lessons in seventy-five minutes

  • Does the output work?
  • Do the tests test anything?
  • Does the description help — and where does it stop?
  • When triggering hits its ceiling, do you invoke explicitly?

One artifact. Four altitudes. That's the discipline.

2.3.6

Bridge to Sam

2 min Lecture
Slide 1 / 2 · After lunch

Block 3 — Sam takes the stage

  • Vibe coding → agentic engineering
  • How working engineers compose skills like this every day
  • Superpowers, beads, hooks, multi-CLI verification, token discipline
  • The discipline you just learned doesn't go away — the workflow around it deepens
Slide 2 / 2 · Lunch

See you at one

"The plugin you used today is the training-wheels version of what Sam does without thinking."

Lunch · 12:00 – 1:00 · See you at 1 PM