regen-eval-baseline
CommunitySnapshot a skill’s eval results, safely.
Software Engineering#quality assurance#TDD#test planning#promptfoo#snapshot testing#delta analysis#eval baseline
Authorsumithr
Version1.0.0
Installs0
System Documentation
What problem does it solve?
Capturing and comparing how a single skill’s behavior changes over time—especially after editing its SKILL.md—without losing historical evidence.
Core Features & Use Cases
- Per-skill eval baselines: Runs promptfoo for one named skill and saves the judge output JSON to a local, gitignored snapshot directory.
- Labelled snapshots with delta: Supports a kebab-case label (default baseline) and automatically reports pass/fail deltas versus the most recent prior snapshot for the same skill.
- Guardrails for reproducibility: Enforces valid kebab-case inputs, checks that the corresponding eval YAML exists, requires OPENAI_API_KEY to be set, and blocks clobbering existing snapshots unless forced.
Quick Start
Run the baseline script with the skill name you want to measure and an optional label by executing: python3 .claude/skills/regen-eval-baseline/scripts/run_baseline.py --skill <skill-name> --label <label>.
Dependency Matrix
Required Modules
None requiredComponents
scripts
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: regen-eval-baseline Download link: https://github.com/sumithr/sumo-qa/archive/main.zip#regen-eval-baseline Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.