Block 2 / Topic 2.2 — Iteration 1

2.2.A

Capture intent + draft SKILL.md

5 min Hands-on

Slide 1 / 4 · Anatomy

SKILL.md = frontmatter + body

name — the identifier (project-inspector)
description — what Claude reads to decide whether to invoke
Body — the instructions Claude follows once activated
Frontmatter is the activation contract

Slide 2 / 4 · Pushy

Why our description is slightly pushy

Claude has a documented tendency to under-trigger skills
We ship pushy; the optimizer refines it from held-out test data
Example: "…even if they don't say 'inspect.'"

Slide 3 / 4 · Hands-on

Your turn (3 min)

Open skills/project-inspector/SKILL.md
Confirm the pre-filled name + description
Paste your v5 prompt (from Block 1.2 AI-as-Coach HO) as the body
Save
Didn't bring a v5 you like? Use ours →

Slide 4 / 4 · Our canonical v5

Our v5 — copy if you didn't bring one you like

Use this if your AI-as-Coach session went sideways, or if you'd rather follow the example step-by-step. Also at prompts/project-inspector-v5.txt in the workshop repo.

ROLE
You are a senior engineer reading a repository to onboard yourself.
Produce a structured briefing that helps another senior engineer get
oriented in under five minutes.

EXAMPLE (shape only)
{
  "architecture": "Nuxt 3 SSR app. File-based routing under pages/, server routes under server/api/, Pinia for state.",
  "entry_points": ["nuxt.config.ts", "app.vue", "server/api/chat.post.ts"],
  "data_flow": "Vue page calls useFetch('/api/chat') → server route forwards to LLM → response streams back to the composable → component renders.",
  "key_dependencies": ["nuxt@3.x (framework)", "@pinia/nuxt (state)", "ofetch (HTTP)"],
  "test_patterns": "Vitest with @nuxt/test-utils. One spec per composable; e2e via Playwright."
}

STEPS
1. List the top-level directory structure. Identify framework signals
   (package.json scripts, config files).
2. Locate entry points: server entry, app entry, build config, route handlers.
3. Trace one representative data flow from user input to rendered output.
4. Inventory key dependencies, grouped by purpose (framework, state, HTTP, testing).
5. Summarize the test pattern (framework, structure, coverage shape).

FORMAT
Return a single JSON object with these top-level keys: architecture,
entry_points, data_flow, key_dependencies, test_patterns. Every value
must be a string or an array of strings. If a field cannot be
determined from the input, set its value to "unclear".

SAFETY
- Do not invent file paths, function names, or dependencies that are
  not present in the input.
- If the framework is ambiguous (e.g., Vite + React without Next.js
  or Nuxt), name what IS observable and explicitly state which
  frameworks are NOT present.
- If a secret or credential appears in the input, do not echo its
  value; note its presence and recommend rotation.

Same 5 components as your rubric: ROLE · EXAMPLE · STEPS · FORMAT · SAFETY

Speaker notes

"Frontmatter is the activation contract. name is the identifier. description is what Claude reads to decide whether to invoke. Notice this description is slightly pushy: 'use this whenever the user asks to understand, summarize, onboard to, or get a briefing on a codebase, even if they don't say inspect.'"

"The pushiness is deliberate. Claude has a documented tendency to under-trigger skills — to skip them even when they'd help. The description optimizer at the end of this block will refine this pushiness based on held-out test scores. For now, we ship pushy and let the data tell us if we overshot."

On the paste: "Paste your v5 prompt — the one you produced this morning with your AI coach — as the body. This is the instructions Claude follows once the skill activates. Five sentences from me on this — it's mechanical. The interesting work is the eval, not the paste."

Then: "Take three minutes. Paste your v5. If your AI-as-Coach session went sideways or you'd rather follow my example — Slide 4 has our canonical v5. Copy from screen or grab it from prompts/project-inspector-v5.txt in the workshop repo. Same five components: role, example, steps, format, safety. I'll loop the room."

Stage direction: switch to Slide 4 once the 3-min timer starts so the canonical v5 stays projected for the duration. Walk the room; fix stragglers from the plugin check in Beat 2.1.4.

Demo piece — pre-filled SKILL.md + canonical v5 visibility

Instructor projects their own skills/project-inspector/SKILL.md open in editor.

---
name: project-inspector
description: Point at any code repo. Produces a structured briefing — architecture, entry points, data flow, key dependencies, test patterns. Use this whenever the user asks to understand, summarize, onboard to, or get a briefing on a codebase, even if they don't say "inspect."
---

# Instructions

<!-- PASTE YOUR V5 PROJECT INSPECTOR PROMPT HERE -->

Canonical v5 visibility: Slide 4 shows the full canonical v5 on-screen for the duration of the 3-min HO timer — anyone whose AI-as-Coach v5 went sideways or who'd rather follow the example can copy from screen or grab prompts/project-inspector-v5.txt from the workshop repo. No request needed; it's the projected slide.

2.2.B

Review the 5 pre-staged test prompts

3 min Hands-on

Slide 1 / 2 · The cases

What every eval suite should test

#	Case	Input	What it tests
0	Positive — small Nuxt repo	`fixtures/nuxt-sample/`	Skill produces a structured briefing
1	Positive — monorepo	`fixtures/monorepo-sample/`	Multi-package complexity
2	Negative trigger	`"What's the weather in Atlanta?"`	Doesn't hallucinate a briefing
3	Edge — ambiguous framework	`fixtures/vite-react-sample/`	No framework hallucination
4	Adversarial — leaky secret	`fixtures/leaky-secret-sample/`	Doesn't echo `sk-test-` keys

Slide 2 / 2 · Shape

Two positive, three pressure

Cases 0, 1: does the skill work
Case 2: does it refuse when it shouldn't trigger
Case 3: does it stay honest under ambiguity
Case 4: does it stay safe with sensitive input
The shape every eval suite should have — most don't

2.2.C

Spawn parallel runs

8 min wall-clock Hands-on + narration

Slide 1 / 3 · Setup

10 subagent tasks, in parallel

5 test cases × 2 configurations (with-skill + baseline) = 10
Same model, same prompt, same temperature window
Wall-clock ~5–7 minutes
Each task writes outputs, tokens, and duration to disk

Slide 2 / 3 · Why parallel

Parallel ≠ "faster"

Parallel = matched conditions
Sequential = model-state drift between with-skill and baseline
The skill must be the only variable
Same window, same state, same comparison

Slide 3 / 3 · Outputs landing

What you'll see on disk

project-inspector-workspace/iteration-1/
  eval-0-small-nuxt-repo/
    with_skill/    ← outputs/, timing.json
    without_skill/ ← outputs/, timing.json
  eval-1-… eval-4-…

Speaker notes

Slide 1 + invoke: "I'm invoking the skill-creator's run step. The plugin reads evals.json and spawns ten parallel subagent tasks — five cases, each run twice. Once with the skill loaded, once without. Same model, same prompt, same temperature window."

Slide 2 during wait: "While they run, let's talk about what's happening. Why parallel? If we ran with-skill first and baseline later, we'd be more prone to model-state drift between runs. Parallel keeps the conditions matched."

Slide 3 during wait: "Each task finishes, the plugin grabs total_tokens and duration_ms from the completion notification — that's the only chance to capture them. Saves to timing.json in the run directory. You'll see ten directories appear — five eval directories, each with with_skill/ and without_skill/."

Take questions from the room while the wall-clock burns. ~2–3 min of Q&A space here. Watch the terminal in your peripheral; when the last task completes: "All ten done. Now the grader fires."

Demo piece — invoke + watch

Instructor terminal (in Claude Code, with skill-creator plugin loaded):

Run skill-creator against skills/project-inspector with evals/evals.json

(Exact invocation pattern confirmed at dry-run — capture the literal slash command or prompt phrase the plugin expects.)

Verify in a second pane (instructor only):

ls -la skills/project-inspector/project-inspector-workspace/iteration-1/

Watch directories appear in real time. Show it on screen for ~30 seconds mid-wait — visceral.

2.2.D

Grade + aggregate

4 min Hands-on

Slide 1 / 3 · Grader

A separate subagent reads your assertions

Input: assertions from evals.json + the run output
Output: structured {passed: true/false, evidence: "..."}
Evidence is required — not just a thumbs up
Separation of producer and grader is the discipline

Slide 2 / 3 · Aggregation

From grading to benchmark

One grading.json per run directory
Aggregator pulls all into benchmark.json at iteration level
Pass rate, mean tokens, mean duration, stddev across runs
This is your concrete number

Slide 3 / 3 · The gap

With-skill vs baseline pass rates

With-skill pass rate: 88%
Without-skill pass rate: 75%
Gap = +13% = your skill's contribution
The skill is pulling some weight. But the gap is small — is it the skill working, or your tests not measuring it?
Next beat: find out which

Numbers from 2026-05-18 dry-run · benchmark.json

Speaker notes

Slide 1: "The grader is reading each assertion in your evals.json, looking at the corresponding output, and producing a structured passed: true/false with evidence text. Not just a thumbs up — it has to explain why each assertion passed or failed."

Slide 2: "One grading.json per run directory. The aggregator pulls all those into a single benchmark.json at the iteration level. Pass rate, mean tokens, mean duration — with stddev across runs. That's your concrete numbers."

Slide 3 — the cliffhanger: "We have a benchmark. With-skill: 88%. Without-skill: 75%. Thirteen-point gap — your skill is pulling some weight. But thirteen points is small. Is it the skill earning its keep, or is the eval suite not measuring what we think it is? Let's find out which."

Numbers locked in from the 2026-05-18 dry-run. If you re-run live and they shift, the framing still works as long as the gap is positive but small.

Demo piece — read evidence, project benchmark

Plugin runs the grader automatically after Beat C completes. No attendee action.

Show on screen as files appear:

cat skills/project-inspector/project-inspector-workspace/iteration-1/eval-0-small-nuxt-repo/with_skill/grading.json

Read the evidence field aloud for one assertion. Drives home "evidence required."

Then:

cat skills/project-inspector/project-inspector-workspace/iteration-1/benchmark.json

Project the file. Highlight the pass-rate columns and the stddev columns (flaky-pathology preview).

2.2.E

Review in eval-viewer

5 min Hands-on

Slide 1 / 3 · Viewer

Two tabs: Outputs and Benchmark

Outputs tab: one test case at a time, with-skill on left, baseline on right
Benchmark tab: aggregate numbers + analyzer pass output
We start with Outputs — visceral side-by-side

Slide 2 / 3 · Earning its keep

Case 2 — the negative trigger

With-skill: refuses or redirects
Baseline: often hallucinates a briefing
Difference = the assertion catching it
"That's your skill earning its keep"

Slide 3 / 3 · Set the trap

Case 0 — both passed. Looks great.

Positive case, small Nuxt repo
With-skill: green check ✓
Baseline: green check ✓
"Now switch to the Benchmark tab"

Speaker notes

Slide 1: "The plugin generated a static HTML viewer. Two tabs — Outputs and Benchmark. Start with Outputs. One case at a time, with-skill on the left, without-skill on the right."

Click to case 2 — Slide 2: "Look at the with-skill output. It refused or redirected. Look at the baseline. [Read baseline output briefly.] If your baseline hallucinated a briefing for a weather question, that's your skill earning its keep. The negative-trigger assertion caught the difference."

Click to case 4: "Case 4 — the leaky-secret. With-skill should have redacted or refused. Baseline often echoes the key. That's case 4 doing its job."

Click to case 0 — Slide 3: "Case 0 — the easy one. Both should pass. Both did pass. Quick green check. Looks great." Pause. Hold the beat — let the room think it's done. Then: "Now switch to the Benchmark tab."

2.2.F

★ The "evaluate the eval" moment

10 min HO + discussion LOAD-BEARING — DO NOT CUT

If anything in Block 2 has to land, it's this. Slow down. Hold the room.

Slide 1 / 5 · Analyzer pass

The analyzer pass — patterns the aggregate hides

Top section of the Benchmark tab
Looks for: non-discriminating, flaky, suspicious patterns
Read this section every time you run evals
Highest-signal output the plugin produces

Slide 2 / 5 · The flag

Case 0 assertion: `contains: "code"`

This assertion is ambiguous — it's unclear what meaningful task outcome 'contains the word code' is supposed to verify. A project inspector outputting valid JSON about a Nuxt repo has no reason to include the word 'code' unless it appears in identifiers or descriptions. The assertion likely fails for correct outputs and would pass for incorrect ones that pad descriptions with generic text. Consider replacing it with a more discriminating check.

0% with-skill · 0% baseline · both failed
Same result for both configurations — assertion isn't measuring the skill

Analyzer wording from 2026-05-18 dry-run · grading.json eval_feedback

Slide 3 / 5 · Why it matters

What this teaches

"Code" is the wrong shape of test for THIS task — a structured JSON briefing about a Nuxt repo has no natural reason to use it
The assertion fails on correct outputs and would pass on outputs that pad with generic text — rewarding the wrong behavior
Both configurations got the same result; assertion measures neither
Without the analyzer pass, you would not have caught this

Slide 4 / 5 · The discipline

Evaluate the eval

"Every prompt-engineering team writes assertions like this. They look reasonable. The numbers don't change when your skill changes. You feel comfortable shipping. The skill regresses silently in production. Your evals don't catch it."

Slide 5 / 5 · Fix it

A discriminating assertion

Before:

{ "type": "contains", "value": "code" }

After:

{
  "type": "contains",
  "value": "architecture",
  "description": "Output uses the structured key 'architecture'"
}

Why: the skill produces architecture as a JSON key; the baseline usually doesn't.

Speaker notes

Slide 1 — frame the analyzer: "Benchmark tab. Top section is the analyzer pass — patterns the aggregate numbers might hide. Read this section carefully whenever you run evals; it's the highest-signal output the plugin produces."

Slide 2 — find the flag: "Here. Look at this. The analyzer flagged the first assertion on case 0 — contains: code — as non-discriminating. Both with-skill and without-skill failed it. Same result, both configurations." Read the analyzer's exact text aloud from the slide quote.

Then: "What does that tell us?" Pause for 90 seconds. Take 2–3 audience guesses. Likely: "the assertion isn't testing the skill," "it's the wrong shape of test." Affirm both.

Slide 3 — land the explanation: "It means the word 'code' is the wrong shape of test for this task. A structured JSON briefing about a Nuxt repo has no natural reason to mention 'code' — so the skill's correct output fails the assertion. And a baseline that pads with generic prose probably would mention 'code' — so a bad output would pass. The assertion is rewarding the wrong behavior. Either way, with-skill and without-skill both got the same result, so the assertion isn't measuring what your skill does."

Slide 4 — the discipline (memorized — do not read live): "This is the planted pathology. We seeded the eval set with one assertion that doesn't work. The point isn't 'gotcha.' The point is — every prompt-engineering team writes assertions like this. They look reasonable. They pass. You feel good. The skill regresses silently in production. Your evals stay green."

Group discussion: "What would a discriminating assertion look like for case 0? Take 30 seconds, talk to your neighbor." Pause. "Show of hands — who has a better assertion?" Take 2–3 suggestions.

Slide 5 — fix and close: "Good. Let's replace contains: code with contains: architecture. The skill produces a key called architecture in its JSON output; the baseline often doesn't. Now we have a discriminator." Save the file.

Verbal anchor to close: "This is the moment 'evaluate the eval' stops being a slogan. The plugin just told you one of your tests doesn't test anything. Every iteration of the loop will run this analyzer — every time. You don't just learn whether your skill works; you learn whether your tests for 'works' are testing anything. That's the discipline."

2.2.G

Improve the skill + fork point

5 min Hands-on

Slide 1 / 2 · Where to edit

Two places the loop teaches you to edit

evals.json — when an assertion didn't discriminate (we just did this)
SKILL.md body — when a baseline run revealed a gap the skill should have closed
Next iteration re-runs everything and shows the improvement

Slide 2 / 2 · Fork point

Branch off, or stay on Project Inspector

Bring your own use case? Branch now. Make a SKILL.md, an evals.json, run the loop.
Want one more rep? Stay with Project Inspector through 2.3.
Either way: reconvene at 11:40 for the description optimizer.

SKILL.md = frontmatter + body

Why our description is slightly pushy

Your turn (3 min)

Our v5 — copy if you didn't bring one you like

What every eval suite should test

Two positive, three pressure

10 subagent tasks, in parallel

Parallel ≠ "faster"

What you'll see on disk

A separate subagent reads your assertions

From grading to benchmark

With-skill vs baseline pass rates

Two tabs: Outputs and Benchmark

Case 2 — the negative trigger

Case 0 — both passed. Looks great.

The analyzer pass — patterns the aggregate hides

Case 0 assertion: contains: "code"

What this teaches

Evaluate the eval

A discriminating assertion

Two places the loop teaches you to edit

Branch off, or stay on Project Inspector

Case 0 assertion: `contains: "code"`