manage-evals

Community

Streamline evaluation management, comparison, and reporting.

AuthorSLEEPYBQ
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill simplifies the process of triggering, comparing, and reporting on AI evaluation runs across different benchmarks and models.

Core Features & Use Cases

  • Trigger Evaluations: Initiate new evaluation runs on various benchmarks with customizable configurations via GitHub workflows.
  • Compare Results: Fetch and analyze two evaluation runs to identify performance differences and regressions at the instance level.
  • Reporting and Commenting: Generate detailed markdown reports highlighting success rates, regressions, and improvements, with options to post comments directly on GitHub pull requests.

Quick Start

Trigger a new evaluation on a specific branch or tag, then compare the latest run with a previous baseline for performance insights.

Dependency Matrix

Required Modules

urllibjson

Components

scriptsreferences

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: manage-evals
Download link: https://github.com/SLEEPYBQ/adaptive-rehearsal/archive/main.zip#manage-evals

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.