Designed for tablet+. View on a tablet or larger screen for the intended layout.
SPEAKER NOTES MODE — press s to hide
BLOCK 2 · BUILD & EVAL A SKILL
10:45 – 11:00 · 15 min · 12L / 3HO
TOPIC 2.1 / 3
Orientation
Skill-creator orientation and the "evaluate the eval" doctrine. Level-set the workflow,
name the three eval pathologies, verify everyone's environment before we drive into the lab.
Slide 1 / 3 · Title
Block 2 — Build & Eval a Skill
Evaluate the eval.
75 min · ~64% hands-on
Slide 2 / 3 · Premise
What this block does
Turn your v5 prompt into a working skill on disk
Run with-skill vs without-skill in parallel
Watch the analyzer catch a test that doesn't test anything
Optimize the description on a held-out test set
Slide 3 / 3 · Workflow
The skill-creator loop
intent → SKILL.md draft → test prompts
↓
spawn parallel runs (with-skill + baseline)
↓
grade + benchmark → analyzer pass → view
↓
improve → next iteration
↓
description optimizer (5 iter, 60/40 train/test)
Speaker notes
Open hard. "In Block 1 you wrote a pretty good prompt. Now we're going to find out how good. And — more importantly — whether your test for 'good' is actually testing anything."
Then set the stake: "Most people in this room who've written prompt evals have written at least one that doesn't measure what they thought it measured. By the end of this block, you'll know how to spot one. The plugin we're about to use does the spotting for you — but only if you know what it's pointing at."
Walk the diagram once, top to bottom, naming each step. Don't explain yet — just name. "That's one iteration. Most teams call this 'evals.' I'm going to insist on calling it 'evaluating the evals' — because that analyzer pass is where the real learning happens."
Demo piece
None. Pure framing — set the room, then transition into the workflow lecture.
Slide 1 / 4 · Parallel
Why parallel runs?
Same prompt, same model, same temperature window
One subagent has the skill loaded; one doesn't
The skill is the only difference
Sequential runs drift; parallel runs match
Slide 2 / 4 · Baseline
The baseline is the discriminator
If the baseline passes your test, your skill isn't earning its keep
That's not a passing test — it's a non-discriminating assertion
Plugin flags assertions where with-skill AND without-skill both pass
First-class concern, not an afterthought
Slide 3 / 4 · Grader
Separation of concerns inside the loop
The grader is a different subagent from the producer
Letting the producer grade itself = AI tar pit ("yes, I did the thing")
Grader reads assertions → {passed, evidence}
Evidence text is required — not just a thumbs up
Slide 4 / 4 · Optimizer
Last step: optimize the trigger
5 auto-iterations on the description field (body unchanged)
20-query trigger eval, split 60/40 train/test
Picks the description with the best test score
Catches overfit before production does
Speaker notes
This is your "design choices of the loop" lecture. Four slides, four design choices, ~1 minute each.
Parallel: model-state drift. Sequential runs see slightly different model states; parallel keeps conditions matched.
Baseline: "If the without-skill run passes your test, your skill isn't earning its keep. That's the seed of evaluate-the-eval."
Grader separation: "Letting the model that produced the output also grade it is the classic AI tar pit. Separation of concerns matters even inside an eval loop." (Memorized line.)
Description optimizer: "Triggering accuracy is the number-one production failure mode for skills. The plugin does this in ten minutes."
Slide 1 / 2 · Pathologies
Three ways evals lie to you
Pathology
What it looks like
Where it surfaces
Non-discriminating
Passes for BOTH with-skill and baseline
Analyzer pass — explicit flag
Flaky / high-variance
Same input, different result on re-runs
benchmark.json mean ± stddev
Overfit
Passes on your fixture, fails everywhere else
Optimizer's 60/40 train/test gap — if optimization is gaining traction
Slide 2 / 2 · Foreshadow
Today we will see…
Non-discriminating — planted in your eval set. Watch for it.
Flaky — described, not planted (too noisy to teach with)
Overfit — may surface in the optimizer report. If it doesn't, the lesson at the end of the block is different.
Speaker notes
Show the table. Walk the three rows.
Row 1: "Non-discriminating — passes for both your skill and the baseline. Means your test doesn't measure what your skill does — it just measures what any model tends to do. We've planted one of these in your eval set."
Row 2: "Flaky — same input, different result across re-runs. Usually shows up in LLM-as-judge assertions where the rubric is too vague. The benchmark shows mean ± stddev. High variance means you can't trust the eval. We didn't plant a flaky one because they're noisy to teach with — but you'll meet them in production."
Row 3: "Overfit — your eval passes on the fixture you wrote it for, fails everywhere else. Where it would surface: the optimizer's train-vs-test gap in Topic 2.3 — if the optimizer is gaining traction. If scores plateau instead, the lesson at the end of the block is different. Either way, you'll learn something about the ceiling on description-only optimization."
Verbal anchor on the second slide: "Three pathologies. Each surfaces in a specific place. By the time we're done, you'll be able to walk the loop on your own."
Slide 1 / 2 · Workspace
What appears on disk
project-inspector/
├── SKILL.md
├── evals/
│ └── evals.json
└── project-inspector-workspace/
└── iteration-1/
├── eval-0-small-nuxt-repo/
│ ├── with_skill/ (outputs + timing + grading)
│ └── without_skill/ (outputs + timing + grading)
├── eval-1-… eval-4-…
├── benchmark.json
└── benchmark.md
Slide 2 / 2 · Setup check (HO)
Before we drive — three checks
cd ai-engineering-workshop
skills/project-inspector/SKILL.md exists as a stub
evals/evals.json exists with 5 cases
Verify skill-creator is installed (one-liner below)
Verify command:
grep -q '"skill-creator@' ~/.claude/plugins/installed_plugins.json \
&& echo "✓ skill-creator installed" \
|| echo "✗ NOT installed"
If the red X appears: launch claude → type /plugin → search "skill-creator" → install.
Speaker notes
Slide 1: one minute on the tree. "Skill goes in its own folder. Eval suite goes in evals/evals.json. Plugin creates project-inspector-workspace/iteration-1/ as a sibling — one directory per case, one for with-skill, one for without-skill, plus a benchmark.json at the iteration level. You'll see all this on disk in about ten minutes."
Slide 2 setup beat: pure environment verification, no judgment. "Everyone open the starter repo. Should see the stub SKILL.md and the 5-case evals.json. Hand-raise — anyone missing either? Send stragglers to back-of-room helper."
Then plugin check: "You did this in Block 1.3. The one-liner on screen reads the plugins config on disk and tells you yes or no. If you get the red X — launch claude, type /plugin, install skill-creator from the marketplace, rejoin. Pause."
Close: "Good. Now we drive."
Demo piece — three commands attendees run
Attendees run these three commands. Instructor mirrors on the projector.
cd ~/code/ai-engineering-workshop
ls skills/project-inspector/SKILL.md evals/evals.json
grep -q '"skill-creator@' ~/.claude/plugins/installed_plugins.json \
&& echo "✓ skill-creator installed" \
|| echo "✗ NOT installed"
If any attendee gets the red X: they install during Beat A's 3-min room loop. There's no break between 2.1 and 2.2 — fix in-band.
Why this check (not claude /plugin list): /plugin is an interactive REPL command, not a CLI subcommand. You can't pipe its output to grep. The installed_plugins.json file is the source of truth — Claude Code writes to it whenever a plugin is installed or updated.
← Workshop home
Topic 2.1 of 3
Topic 2.2 — Iteration 1 →