eval-harness

Community

Formal eval framework for Claude Code sessions

Authorflag3
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill provides a structured framework to define, execute, and observe evaluations for Claude Code sessions, enabling repeatable, traceable improvements through eval-driven development.

Core Features & Use Cases

  • Capability Evals: formal tests that verify new features behave as intended.
  • Regression Evals: checks that existing functionality remains intact after changes.
  • Grader Options: supports Code-Based, Model-Based, and Human graders for flexible assessment.
  • Workflow & Storage: a repeatable workflow with defined storage for eval definitions and histories.

Quick Start

Define a new evaluation workflow for a Claude Code feature and run: /eval define <feature-name>

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: eval-harness
Download link: https://github.com/flag3/dotfiles/archive/main.zip#eval-harness

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.