manage-evals
CommunityStreamline evaluation management, comparison, and reporting.
AuthorSLEEPYBQ
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill simplifies the process of triggering, comparing, and reporting on AI evaluation runs across different benchmarks and models.
Core Features & Use Cases
- Trigger Evaluations: Initiate new evaluation runs on various benchmarks with customizable configurations via GitHub workflows.
- Compare Results: Fetch and analyze two evaluation runs to identify performance differences and regressions at the instance level.
- Reporting and Commenting: Generate detailed markdown reports highlighting success rates, regressions, and improvements, with options to post comments directly on GitHub pull requests.
Quick Start
Trigger a new evaluation on a specific branch or tag, then compare the latest run with a previous baseline for performance insights.
Dependency Matrix
Required Modules
urllibjson
Components
scriptsreferences
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: manage-evals Download link: https://github.com/SLEEPYBQ/adaptive-rehearsal/archive/main.zip#manage-evals Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.