os-eval-runner
CommunityAutomates safe, autonomous skill evaluation loops.
System Documentation
What problem does it solve?
Stateless evaluation engine that scores and gates skill improvement iterations using headless Python evaluation scripts. Use when the user says "evaluate this skill", "run autoresearch loop on", "optimize this skill", "run the eval loop", or when another agent proposes a change to an existing skill and needs empirical validation before applying it. Supports autonomous loop mode for iterative improvement and single-shot QA mode for validating one specific proposed change. Requires Python 3.8+ and a git repository.
Core Features & Use Cases
- Pure metric producer:
eval_runner.pyreads a target andevals.jsonand emits objective scores without side effects. - Loop gate:
evaluate.pyenforces baseline, keeps or reverts changes, and appends results to the per-target ledger. - Scaffold templates:
init_autoresearch.pydeploys standard program/evals/results templates into your experiment. - Autonomous optimization: runs iterative mutations on a single-mutation target with KEEP/DISCARD decisions, or validates a specific change (QA mode).
Quick Start
Install and initialize an autoresearch experiment, then baseline the target with evaluate.py and start the loop.
Dependency Matrix
Required Modules
None requiredComponents
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: os-eval-runner Download link: https://github.com/richfrem/agent-plugins-skills/archive/main.zip#os-eval-runner Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.