eval-harness-kit
CommunityBuild and run deterministic eval suites
Authoraufrank
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill automates the creation and execution of evaluation suites for agent workflows, ensuring reproducible results and tracking capabilities or regressions.
Core Features & Use Cases
- Manifest-driven Evals: Define tasks, inputs, and grading criteria in a JSON manifest.
- Deterministic & LLM Grading: Supports exact match, regex, JSON comparison, and optional LLM rubrics.
- Use Case: You've developed a new agent for summarizing documents. Use this Skill to create an evaluation suite with various documents and expected summary qualities, then run the agent against it to measure performance and identify regressions over time.
Quick Start
Run the example manifest using python <CODEX_HOME>/skills/eval-harness-kit/scripts/run_eval.py --manifest <path> --run-id <id>.
Dependency Matrix
Required Modules
None requiredComponents
scriptsreferencesassets
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: eval-harness-kit Download link: https://github.com/aufrank/agent-skills/archive/main.zip#eval-harness-kit Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.