Block 2 / Topic 2.1 — Orientation

2.1.1

Opening hook

2 min Lecture

Slide 1 / 3 · Title

Block 2 — Build & Eval a Skill

Evaluate the eval.

75 min · ~64% hands-on

Slide 2 / 3 · Premise

What this block does

Turn your v5 prompt into a working skill on disk
Run with-skill vs without-skill in parallel
Watch the analyzer catch a test that doesn't test anything
Optimize the description on a held-out test set

Slide 3 / 3 · Workflow

The skill-creator loop

intent → SKILL.md draft → test prompts
            ↓
    spawn parallel runs (with-skill + baseline)
            ↓
    grade + benchmark  →  analyzer pass  →  view
            ↓
    improve  →  next iteration
            ↓
    description optimizer (5 iter, 60/40 train/test)

Speaker notes

Open hard. "In Block 1 you wrote a pretty good prompt. Now we're going to find out how good. And — more importantly — whether your test for 'good' is actually testing anything."

Then set the stake: "Most people in this room who've written prompt evals have written at least one that doesn't measure what they thought it measured. By the end of this block, you'll know how to spot one. The plugin we're about to use does the spotting for you — but only if you know what it's pointing at."

Walk the diagram once, top to bottom, naming each step. Don't explain yet — just name. "That's one iteration. Most teams call this 'evals.' I'm going to insist on calling it 'evaluating the evals' — because that analyzer pass is where the real learning happens."

2.1.2

The skill-creator workflow at a glance

4 min Lecture

Slide 1 / 4 · Parallel

Why parallel runs?

Same prompt, same model, same temperature window
One subagent has the skill loaded; one doesn't
The skill is the only difference
Sequential runs drift; parallel runs match

Slide 2 / 4 · Baseline

The baseline is the discriminator

If the baseline passes your test, your skill isn't earning its keep
That's not a passing test — it's a non-discriminating assertion
Plugin flags assertions where with-skill AND without-skill both pass
First-class concern, not an afterthought

Slide 3 / 4 · Grader

Separation of concerns inside the loop

The grader is a different subagent from the producer
Letting the producer grade itself = AI tar pit ("yes, I did the thing")
Grader reads assertions → {passed, evidence}
Evidence text is required — not just a thumbs up

Slide 4 / 4 · Optimizer

Last step: optimize the trigger

5 auto-iterations on the description field (body unchanged)
20-query trigger eval, split 60/40 train/test
Picks the description with the best test score
Catches overfit before production does

2.1.3

The three eval pathologies

5 min Lecture

Slide 1 / 2 · Pathologies

Three ways evals lie to you

Pathology	What it looks like	Where it surfaces
Non-discriminating	Passes for BOTH with-skill and baseline	Analyzer pass — explicit flag
Flaky / high-variance	Same input, different result on re-runs	`benchmark.json` mean ± stddev
Overfit	Passes on your fixture, fails everywhere else	Optimizer's 60/40 train/test gap — if optimization is gaining traction

Slide 2 / 2 · Foreshadow

Today we will see…

Non-discriminating — planted in your eval set. Watch for it.
Flaky — described, not planted (too noisy to teach with)
Overfit — may surface in the optimizer report. If it doesn't, the lesson at the end of the block is different.

Speaker notes

Show the table. Walk the three rows.

Row 1: "Non-discriminating — passes for both your skill and the baseline. Means your test doesn't measure what your skill does — it just measures what any model tends to do. We've planted one of these in your eval set."

Row 2: "Flaky — same input, different result across re-runs. Usually shows up in LLM-as-judge assertions where the rubric is too vague. The benchmark shows mean ± stddev. High variance means you can't trust the eval. We didn't plant a flaky one because they're noisy to teach with — but you'll meet them in production."

Row 3: "Overfit — your eval passes on the fixture you wrote it for, fails everywhere else. Where it would surface: the optimizer's train-vs-test gap in Topic 2.3 — if the optimizer is gaining traction. If scores plateau instead, the lesson at the end of the block is different. Either way, you'll learn something about the ceiling on description-only optimization."

Verbal anchor on the second slide: "Three pathologies. Each surfaces in a specific place. By the time we're done, you'll be able to walk the loop on your own."

2.1.4

Workspace layout + hands-on setup

4 min 1L / 3HO

Slide 1 / 2 · Workspace

What appears on disk

project-inspector/
├── SKILL.md
├── evals/
│   └── evals.json
└── project-inspector-workspace/
    └── iteration-1/
        ├── eval-0-small-nuxt-repo/
        │   ├── with_skill/   (outputs + timing + grading)
        │   └── without_skill/ (outputs + timing + grading)
        ├── eval-1-… eval-4-…
        ├── benchmark.json
        └── benchmark.md

Slide 2 / 2 · Setup check (HO)

Before we drive — three checks

cd ai-engineering-workshop
skills/project-inspector/SKILL.md exists as a stub
evals/evals.json exists with 5 cases
Verify skill-creator is installed (one-liner below)

Verify command:

grep -q '"skill-creator@' ~/.claude/plugins/installed_plugins.json \
  && echo "✓ skill-creator installed" \
  || echo "✗ NOT installed"

If the red X appears: launch claude → type /plugin → search "skill-creator" → install.

Speaker notes

Slide 1: one minute on the tree. "Skill goes in its own folder. Eval suite goes in evals/evals.json. Plugin creates project-inspector-workspace/iteration-1/ as a sibling — one directory per case, one for with-skill, one for without-skill, plus a benchmark.json at the iteration level. You'll see all this on disk in about ten minutes."

Slide 2 setup beat: pure environment verification, no judgment. "Everyone open the starter repo. Should see the stub SKILL.md and the 5-case evals.json. Hand-raise — anyone missing either? Send stragglers to back-of-room helper."

Then plugin check: "You did this in Block 1.3. The one-liner on screen reads the plugins config on disk and tells you yes or no. If you get the red X — launch claude, type /plugin, install skill-creator from the marketplace, rejoin. Pause."

Close: "Good. Now we drive."

Demo piece — three commands attendees run

Attendees run these three commands. Instructor mirrors on the projector.

cd ~/code/ai-engineering-workshop
ls skills/project-inspector/SKILL.md evals/evals.json
grep -q '"skill-creator@' ~/.claude/plugins/installed_plugins.json \
  && echo "✓ skill-creator installed" \
  || echo "✗ NOT installed"

If any attendee gets the red X: they install during Beat A's 3-min room loop. There's no break between 2.1 and 2.2 — fix in-band.

Why this check (not claude /plugin list): /plugin is an interactive REPL command, not a CLI subcommand. You can't pipe its output to grep. The installed_plugins.json file is the source of truth — Claude Code writes to it whenever a plugin is installed or updated.