Designed for tablet+. View on a tablet or larger screen for the intended layout.
CORE_WORKSHOP_v1.0
SPEAKER NOTES MODE — press s to hide
BLOCK 2 · BUILD & EVAL A SKILL 10:45 – 11:00 · 15 min · 12L / 3HO

TOPIC 2.1 / 3

Orientation

Skill-creator orientation and the "evaluate the eval" doctrine. Level-set the workflow, name the three eval pathologies, verify everyone's environment before we drive into the lab.

2.1.1

Opening hook

2 min Lecture
Slide 1 / 3 · Title

Block 2 — Build & Eval a Skill

Evaluate the eval.

75 min · ~64% hands-on

Slide 2 / 3 · Premise

What this block does

  • Turn your v5 prompt into a working skill on disk
  • Run with-skill vs without-skill in parallel
  • Watch the analyzer catch a test that doesn't test anything
  • Optimize the description on a held-out test set
Slide 3 / 3 · Workflow

The skill-creator loop

intent → SKILL.md draft → test prompts
            ↓
    spawn parallel runs (with-skill + baseline)
            ↓
    grade + benchmark  →  analyzer pass  →  view
            ↓
    improve  →  next iteration
            ↓
    description optimizer (5 iter, 60/40 train/test)
2.1.2

The skill-creator workflow at a glance

4 min Lecture
Slide 1 / 4 · Parallel

Why parallel runs?

  • Same prompt, same model, same temperature window
  • One subagent has the skill loaded; one doesn't
  • The skill is the only difference
  • Sequential runs drift; parallel runs match
Slide 2 / 4 · Baseline

The baseline is the discriminator

  • If the baseline passes your test, your skill isn't earning its keep
  • That's not a passing test — it's a non-discriminating assertion
  • Plugin flags assertions where with-skill AND without-skill both pass
  • First-class concern, not an afterthought
Slide 3 / 4 · Grader

Separation of concerns inside the loop

  • The grader is a different subagent from the producer
  • Letting the producer grade itself = AI tar pit ("yes, I did the thing")
  • Grader reads assertions → {passed, evidence}
  • Evidence text is required — not just a thumbs up
Slide 4 / 4 · Optimizer

Last step: optimize the trigger

  • 5 auto-iterations on the description field (body unchanged)
  • 20-query trigger eval, split 60/40 train/test
  • Picks the description with the best test score
  • Catches overfit before production does
2.1.3

The three eval pathologies

5 min Lecture
Slide 1 / 2 · Pathologies

Three ways evals lie to you

Pathology What it looks like Where it surfaces
Non-discriminating Passes for BOTH with-skill and baseline Analyzer pass — explicit flag
Flaky / high-variance Same input, different result on re-runs benchmark.json mean ± stddev
Overfit Passes on your fixture, fails everywhere else Optimizer's 60/40 train/test gap — if optimization is gaining traction
Slide 2 / 2 · Foreshadow

Today we will see…

  • Non-discriminating — planted in your eval set. Watch for it.
  • Flaky — described, not planted (too noisy to teach with)
  • Overfitmay surface in the optimizer report. If it doesn't, the lesson at the end of the block is different.
2.1.4

Workspace layout + hands-on setup

4 min 1L / 3HO
Slide 1 / 2 · Workspace

What appears on disk

project-inspector/
├── SKILL.md
├── evals/
│   └── evals.json
└── project-inspector-workspace/
    └── iteration-1/
        ├── eval-0-small-nuxt-repo/
        │   ├── with_skill/   (outputs + timing + grading)
        │   └── without_skill/ (outputs + timing + grading)
        ├── eval-1-… eval-4-…
        ├── benchmark.json
        └── benchmark.md
Slide 2 / 2 · Setup check (HO)

Before we drive — three checks

  • cd ai-engineering-workshop
  • skills/project-inspector/SKILL.md exists as a stub
  • evals/evals.json exists with 5 cases
  • Verify skill-creator is installed (one-liner below)

Verify command:

grep -q '"skill-creator@' ~/.claude/plugins/installed_plugins.json \
  && echo "✓ skill-creator installed" \
  || echo "✗ NOT installed"

If the red X appears: launch claude → type /plugin → search "skill-creator" → install.