os-eval-runner

Community

Automates safe, autonomous skill evaluation loops.

Authorrichfrem
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Stateless evaluation engine that scores and gates skill improvement iterations using headless Python evaluation scripts. Use when the user says "evaluate this skill", "run autoresearch loop on", "optimize this skill", "run the eval loop", or when another agent proposes a change to an existing skill and needs empirical validation before applying it. Supports autonomous loop mode for iterative improvement and single-shot QA mode for validating one specific proposed change. Requires Python 3.8+ and a git repository.

Core Features & Use Cases

  • Pure metric producer: eval_runner.py reads a target and evals.json and emits objective scores without side effects.
  • Loop gate: evaluate.py enforces baseline, keeps or reverts changes, and appends results to the per-target ledger.
  • Scaffold templates: init_autoresearch.py deploys standard program/evals/results templates into your experiment.
  • Autonomous optimization: runs iterative mutations on a single-mutation target with KEEP/DISCARD decisions, or validates a specific change (QA mode).

Quick Start

Install and initialize an autoresearch experiment, then baseline the target with evaluate.py and start the loop.

Dependency Matrix

Required Modules

None required

Components

scriptsreferencesassets

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: os-eval-runner
Download link: https://github.com/richfrem/agent-plugins-skills/archive/main.zip#os-eval-runner

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.