skill-eval-loop
Run the observe-analyze-iterate loop on promptfoo evals for a skill collection. Promptfoo-specific — assumes promptfoo is installed, tests live in YAML, and results are in the standard SQLite DB at ~/.promptfoo/promptfoo.db. Use when the user has a promptfoo eval suite and wants to diagnose failures, fix them, and re-run targeted tests. Triggers on phrases like "eval failed", "analyze the promptfoo eval", "iterate on skills", or any request to improve skills based on promptfoo output.
Run the observe → analyze → iterate loop on a promptfoo eval suite for a skill collection. The goal: turn ambiguous failures into categorized, actionable fixes, then validate the fix with a targeted re-run.
Scope: Promptfoo Only
This skill is tightly scoped to promptfoo. It
uses promptfoo’s --filter-pattern flag, its SQLite schema
(eval_results, evals tables, test_case/response/grading_result JSON
columns), and its YAML test format. It will not work with other eval
harnesses (DeepEval, Vitest-based evals, bespoke Python runners, etc.).
The reasoning content — the three failure categories (skill weakness / false
positive / test design issue) and the fix recipes — is portable to any eval
framework. But every command in this skill assumes promptfoo. If the user is
on a different harness, read references/categorization.md and
references/fix-patterns.md for the portable reasoning, then adapt the
commands by hand.
Prerequisites — Check Before Proceeding
Run these checks first. If any fail, stop and tell the user what’s missing rather than trying to work around it.
# 1. promptfoo CLI available?
command -v promptfoo >/dev/null || {
echo "ERROR: promptfoo not on PATH. Install: npm install -g promptfoo"
exit 1
}
# 2. promptfoo results DB exists?
test -f "${PROMPTFOO_DB:-$HOME/.promptfoo/promptfoo.db}" || {
echo "ERROR: no promptfoo DB found. Run an eval first."
exit 1
}
# 3. An inference endpoint is reachable?
# Ollama: curl -s http://localhost:11434/api/tags
# Or: check that the relevant API key env var is set.
Also confirm:
- A promptfoo eval config (
promptfooconfig.yamlor similar) exists in the project - The skills being evaluated are in a directory you can edit
Ask the user for the eval ID to analyze, or query for the most recent:
sqlite3 ~/.promptfoo/promptfoo.db \
"SELECT id, created_at FROM evals ORDER BY created_at DESC LIMIT 5;"
Step 1: Observe
Query the promptfoo DB for results of the eval in question:
sqlite3 ~/.promptfoo/promptfoo.db "
SELECT
json_extract(test_case, '\$.description') as test,
success,
substr(json_extract(response, '\$.output'), 1, 800) as output_preview,
json_extract(grading_result, '\$.reason') as reason
FROM eval_results
WHERE eval_id = '<EVAL_ID>'
ORDER BY success ASC, test;
"
Read the full output the model produced, not just the grader’s reason. The output shows what the model actually did; the reason tells you what the assertion thought was wrong. Often they disagree — that’s the interesting case.
Step 2: Categorize Each Failure
Every failure falls into one of three buckets. Each has a different fix location.
Skill weakness
The model produced output that genuinely violates the skill’s rules (e.g., recommends a banned font in the actual CSS it emits, skips a required step, asks for confirmation instead of producing the deliverable).
Fix: Edit the SKILL.md.
- Inline the positive default the model should reach for
- Tighten ban language (“never name these fonts anywhere” beats “don’t use Inter”)
- Require the deliverable in the first response, not in a later round
- Remove bare
/other-skillcommands from workflow steps
False positive
The model did the right thing but the assertion penalized it. Common causes:
- Substring match on a banned term that also appears in legitimate words (“Inter” in “interfaces”)
- Anywhere-in-output check on a banned pattern that was correctly being criticized in prose
- JS assertion throwing because promptfoo passed
outputas an object, not a string - LLM-rubric grader used a default provider without an API key
Fix: Edit the assertion.
- Word-boundary regex instead of substring (
\bInter\b) - CSS-declaration scoping instead of anywhere match
- Output-object unwrapping at the top of every JS assertion
- Point grader at the same local model
Test design issue
The test asks for contradictory behavior. Example: “Do a UX audit” with no code → skill correctly refuses → assertion fails because there’s no audit to grade. Or: test provides partial context, skill correctly gates on missing context, assertion expects the skill to proceed anyway.
Fix: Edit the test.
- Provide the inputs the skill needs inline in the message
- Split a multi-assertion test into separate focused tests
- Test the gating behavior explicitly (“correctly asks for X”) as its own case
See references/categorization.md for detailed diagnostic questions for each
bucket.
Step 3: Iterate
Apply the appropriate fix. Follow these principles:
- Don’t fight the model. If the model reaches for a banned pattern, give it a better default inline.
- Use positive framing. Show what TO use, not just what to avoid.
- Don’t ask the model to infer what you can pre-compute. Lookup tables beat rules.
- Critical content lives in SKILL.md, not references/. References get skipped.
- Bump the skill version on every skill change (e.g., 1.0.0 → 1.1.0).
See references/fix-patterns.md for specific edit recipes for each failure category.
Step 4: Re-run Targeted Eval
Only re-run the failing tests, not the full suite. Construct a filter pattern from the test descriptions:
promptfoo eval -c <config>.yaml --no-cache \
--filter-pattern "Test A description|Test B description"
Wait for completion. Note the new eval ID from the output.
Step 5: Validate
Query the new eval ID the same way as Step 1. For each previously failing test:
- Passed: Fix worked. Commit the changes.
- Still failing: Look at the output again. Was the diagnosis wrong? Was the fix incomplete?
If still failing, iterate once more — but no more than two rounds total. If a third round is needed, the problem is deeper than surface fixes and requires a structural rethink of the skill.
Step 6: Report and Commit
Commit with a message that explains the failure mode and the fix, not just what changed:
Fix design-typeset eval failures: require CSS in Round 1 output
Previously the skill's "propose then wait for confirmation" workflow
meant single-turn evals never saw any CSS. Changed Round 1 to emit
working CSS on every response, with follow-up questions after the
deliverable, not before.
Report to the user:
- What you observed (tests with failure categories)
- What you changed (files + rationale)
- Re-run results (pass/fail count)
- Whether another iteration is warranted
Anti-patterns
- Don’t re-run the full suite when only a few tests failed — wastes time and tokens
- Don’t blindly trust the grader’s reason — read the full output to understand what actually happened
- Don’t “fix” a skill that’s doing the right thing — if the assertion is wrong, fix the assertion
- Don’t skip categorization — applying a skill-weakness fix to a false-positive failure makes the skill worse
- Don’t iterate more than twice — three iterations means the approach is wrong, not the details
References
references/categorization.md— diagnostic questions to classify each failurereferences/fix-patterns.md— edit recipes for each category (skill, assertion, test)references/local-eval-setup.md— Ollama provider config, LLM-as-judge setup, concurrency tuningscripts/pull_failures.sh— convenience script to dump failures from the eval DB