Name: regen-eval-baseline
Availability: InStock
Author: sumithr

System Documentation

What problem does it solve?

Capturing and comparing how a single skill’s behavior changes over time—especially after editing its SKILL.md—without losing historical evidence.

Core Features & Use Cases

Per-skill eval baselines: Runs promptfoo for one named skill and saves the judge output JSON to a local, gitignored snapshot directory.
Labelled snapshots with delta: Supports a kebab-case label (default baseline) and automatically reports pass/fail deltas versus the most recent prior snapshot for the same skill.
Guardrails for reproducibility: Enforces valid kebab-case inputs, checks that the corresponding eval YAML exists, requires OPENAI_API_KEY to be set, and blocks clobbering existing snapshots unless forced.

Quick Start

Run the baseline script with the skill name you want to measure and an optional label by executing: python3 .claude/skills/regen-eval-baseline/scripts/run_baseline.py --skill <skill-name> --label <label>.

Please help me install this Skill: Name: regen-eval-baseline Download link: https://github.com/sumithr/sumo-qa/archive/main.zip#regen-eval-baseline Please download this .zip file, extract it, and install it in the .claude/skills/ directory.

regen-eval-baseline

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper