Name: manage-evals
Availability: InStock
Author: SLEEPYBQ

System Documentation

What problem does it solve?

This Skill simplifies the process of triggering, comparing, and reporting on AI evaluation runs across different benchmarks and models.

Core Features & Use Cases

Trigger Evaluations: Initiate new evaluation runs on various benchmarks with customizable configurations via GitHub workflows.
Compare Results: Fetch and analyze two evaluation runs to identify performance differences and regressions at the instance level.
Reporting and Commenting: Generate detailed markdown reports highlighting success rates, regressions, and improvements, with options to post comments directly on GitHub pull requests.

Quick Start

Trigger a new evaluation on a specific branch or tag, then compare the latest run with a previous baseline for performance insights.

Please help me install this Skill: Name: manage-evals Download link: https://github.com/SLEEPYBQ/adaptive-rehearsal/archive/main.zip#manage-evals Please download this .zip file, extract it, and install it in the .claude/skills/ directory.

manage-evals

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper