eval-harness-kit

Community

Build and run deterministic eval suites

Authoraufrank
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill automates the creation and execution of evaluation suites for agent workflows, ensuring reproducible results and tracking capabilities or regressions.

Core Features & Use Cases

  • Manifest-driven Evals: Define tasks, inputs, and grading criteria in a JSON manifest.
  • Deterministic & LLM Grading: Supports exact match, regex, JSON comparison, and optional LLM rubrics.
  • Use Case: You've developed a new agent for summarizing documents. Use this Skill to create an evaluation suite with various documents and expected summary qualities, then run the agent against it to measure performance and identify regressions over time.

Quick Start

Run the example manifest using python <CODEX_HOME>/skills/eval-harness-kit/scripts/run_eval.py --manifest <path> --run-id <id>.

Dependency Matrix

Required Modules

None required

Components

scriptsreferencesassets

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: eval-harness-kit
Download link: https://github.com/aufrank/agent-skills/archive/main.zip#eval-harness-kit

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.