regen-eval-baseline

Community

Snapshot a skill’s eval results, safely.

Authorsumithr
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Capturing and comparing how a single skill’s behavior changes over time—especially after editing its SKILL.md—without losing historical evidence.

Core Features & Use Cases

  • Per-skill eval baselines: Runs promptfoo for one named skill and saves the judge output JSON to a local, gitignored snapshot directory.
  • Labelled snapshots with delta: Supports a kebab-case label (default baseline) and automatically reports pass/fail deltas versus the most recent prior snapshot for the same skill.
  • Guardrails for reproducibility: Enforces valid kebab-case inputs, checks that the corresponding eval YAML exists, requires OPENAI_API_KEY to be set, and blocks clobbering existing snapshots unless forced.

Quick Start

Run the baseline script with the skill name you want to measure and an optional label by executing: python3 .claude/skills/regen-eval-baseline/scripts/run_baseline.py --skill <skill-name> --label <label>.

Dependency Matrix

Required Modules

None required

Components

scripts

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: regen-eval-baseline
Download link: https://github.com/sumithr/sumo-qa/archive/main.zip#regen-eval-baseline

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.