skill-eval-loop — Stuffbucket Skills

← Catalog

Run the observe-analyze-iterate loop on promptfoo evals for a skill collection. Promptfoo-specific — assumes promptfoo is installed, tests live in YAML, and results are in the standard SQLite DB at ~/.promptfoo/promptfoo.db. Use when the user has a promptfoo eval suite and wants to diagnose failures, fix them, and re-run targeted tests. Triggers on phrases like "eval failed", "analyze the promptfoo eval", "iterate on skills", or any request to improve skills based on promptfoo output.

1 script references Source

Run the observe → analyze → iterate loop on a promptfoo eval suite for a skill collection. The goal: turn ambiguous failures into categorized, actionable fixes, then validate the fix with a targeted re-run.

Scope: Promptfoo Only

This skill is tightly scoped to promptfoo. It uses promptfoo’s --filter-pattern flag, its SQLite schema (eval_results, evals tables, test_case/response/grading_result JSON columns), and its YAML test format. It will not work with other eval harnesses (DeepEval, Vitest-based evals, bespoke Python runners, etc.).

The reasoning content — the three failure categories (skill weakness / false positive / test design issue) and the fix recipes — is portable to any eval framework. But every command in this skill assumes promptfoo. If the user is on a different harness, read references/categorization.md and references/fix-patterns.md for the portable reasoning, then adapt the commands by hand.

Prerequisites — Check Before Proceeding

Run these checks first. If any fail, stop and tell the user what’s missing rather than trying to work around it.

# 1. promptfoo CLI available?
command -v promptfoo >/dev/null || {
  echo "ERROR: promptfoo not on PATH. Install: npm install -g promptfoo"
  exit 1
}

# 2. promptfoo results DB exists?
test -f "${PROMPTFOO_DB:-$HOME/.promptfoo/promptfoo.db}" || {
  echo "ERROR: no promptfoo DB found. Run an eval first."
  exit 1
}

# 3. An inference endpoint is reachable?
#    Ollama: curl -s http://localhost:11434/api/tags
#    Or: check that the relevant API key env var is set.

Also confirm:

A promptfoo eval config (promptfooconfig.yaml or similar) exists in the project
The skills being evaluated are in a directory you can edit

Ask the user for the eval ID to analyze, or query for the most recent:

sqlite3 ~/.promptfoo/promptfoo.db \
  "SELECT id, created_at FROM evals ORDER BY created_at DESC LIMIT 5;"

Step 1: Observe

Query the promptfoo DB for results of the eval in question:

sqlite3 ~/.promptfoo/promptfoo.db "
SELECT
  json_extract(test_case, '\$.description') as test,
  success,
  substr(json_extract(response, '\$.output'), 1, 800) as output_preview,
  json_extract(grading_result, '\$.reason') as reason
FROM eval_results
WHERE eval_id = '<EVAL_ID>'
ORDER BY success ASC, test;
"

Read the full output the model produced, not just the grader’s reason. The output shows what the model actually did; the reason tells you what the assertion thought was wrong. Often they disagree — that’s the interesting case.

Step 2: Categorize Each Failure

Every failure falls into one of three buckets. Each has a different fix location.

Skill weakness

The model produced output that genuinely violates the skill’s rules (e.g., recommends a banned font in the actual CSS it emits, skips a required step, asks for confirmation instead of producing the deliverable).

Fix: Edit the SKILL.md.

Inline the positive default the model should reach for
Tighten ban language (“never name these fonts anywhere” beats “don’t use Inter”)
Require the deliverable in the first response, not in a later round
Remove bare /other-skill commands from workflow steps

False positive

The model did the right thing but the assertion penalized it. Common causes:

Substring match on a banned term that also appears in legitimate words (“Inter” in “interfaces”)
Anywhere-in-output check on a banned pattern that was correctly being criticized in prose
JS assertion throwing because promptfoo passed output as an object, not a string
LLM-rubric grader used a default provider without an API key

Fix: Edit the assertion.

Word-boundary regex instead of substring (\bInter\b)
CSS-declaration scoping instead of anywhere match
Output-object unwrapping at the top of every JS assertion
Point grader at the same local model

Test design issue

The test asks for contradictory behavior. Example: “Do a UX audit” with no code → skill correctly refuses → assertion fails because there’s no audit to grade. Or: test provides partial context, skill correctly gates on missing context, assertion expects the skill to proceed anyway.

Fix: Edit the test.

Provide the inputs the skill needs inline in the message
Split a multi-assertion test into separate focused tests
Test the gating behavior explicitly (“correctly asks for X”) as its own case

See references/categorization.md for detailed diagnostic questions for each bucket.

Step 3: Iterate

Apply the appropriate fix. Follow these principles:

Don’t fight the model. If the model reaches for a banned pattern, give it a better default inline.
Use positive framing. Show what TO use, not just what to avoid.
Don’t ask the model to infer what you can pre-compute. Lookup tables beat rules.
Critical content lives in SKILL.md, not references/. References get skipped.
Bump the skill version on every skill change (e.g., 1.0.0 → 1.1.0).

See references/fix-patterns.md for specific edit recipes for each failure category.

Step 4: Re-run Targeted Eval

Only re-run the failing tests, not the full suite. Construct a filter pattern from the test descriptions:

promptfoo eval -c <config>.yaml --no-cache \
  --filter-pattern "Test A description|Test B description"

Wait for completion. Note the new eval ID from the output.

Step 5: Validate

Query the new eval ID the same way as Step 1. For each previously failing test:

Passed: Fix worked. Commit the changes.
Still failing: Look at the output again. Was the diagnosis wrong? Was the fix incomplete?

If still failing, iterate once more — but no more than two rounds total. If a third round is needed, the problem is deeper than surface fixes and requires a structural rethink of the skill.

Step 6: Report and Commit

Commit with a message that explains the failure mode and the fix, not just what changed:

Fix design-typeset eval failures: require CSS in Round 1 output

Previously the skill's "propose then wait for confirmation" workflow
meant single-turn evals never saw any CSS. Changed Round 1 to emit
working CSS on every response, with follow-up questions after the
deliverable, not before.

Report to the user:

What you observed (tests with failure categories)
What you changed (files + rationale)
Re-run results (pass/fail count)
Whether another iteration is warranted

Anti-patterns

Don’t re-run the full suite when only a few tests failed — wastes time and tokens
Don’t blindly trust the grader’s reason — read the full output to understand what actually happened
Don’t “fix” a skill that’s doing the right thing — if the assertion is wrong, fix the assertion
Don’t skip categorization — applying a skill-weakness fix to a false-positive failure makes the skill worse
Don’t iterate more than twice — three iterations means the approach is wrong, not the details

References

references/categorization.md — diagnostic questions to classify each failure
references/fix-patterns.md — edit recipes for each category (skill, assertion, test)
references/local-eval-setup.md — Ollama provider config, LLM-as-judge setup, concurrency tuning
scripts/pull_failures.sh — convenience script to dump failures from the eval DB