model-eval-benchmark

Community

Benchmark LLMs with repeatable evals.

Authorcirne
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Runs brain-app JSONL agent evals (Enron v1, wiki v1) and compares model runs on pass rate, wall time, tokens, and estimated cost from report JSON. Use when benchmarking LLMs, comparing providers, price/performance analysis, or after changing supported-llm-models.json / LLM_PROVIDER + LLM_MODEL.

Core Features & Use Cases

  • Consistent eval harness across models with identical tasks and environments.
  • Detailed metrics: pass rate, wall time, tokens, and cost from aggregated usage.
  • Use Case: Benchmark new models or providers to decide on deployments or pricing.

Quick Start

Run the full evaluation with your chosen provider and model to generate a comparison report.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: model-eval-benchmark
Download link: https://github.com/cirne/brain-app/archive/main.zip#model-eval-benchmark

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.