Name: model-eval-benchmark
Availability: InStock
Author: cirne

System Documentation

What problem does it solve?

Runs brain-app JSONL agent evals (Enron v1, wiki v1) and compares model runs on pass rate, wall time, tokens, and estimated cost from report JSON. Use when benchmarking LLMs, comparing providers, price/performance analysis, or after changing supported-llm-models.json / LLM_PROVIDER + LLM_MODEL.

Core Features & Use Cases

Consistent eval harness across models with identical tasks and environments.
Detailed metrics: pass rate, wall time, tokens, and cost from aggregated usage.
Use Case: Benchmark new models or providers to decide on deployments or pricing.

Quick Start

Run the full evaluation with your chosen provider and model to generate a comparison report.

Please help me install this Skill: Name: model-eval-benchmark Download link: https://github.com/cirne/brain-app/archive/main.zip#model-eval-benchmark Please download this .zip file, extract it, and install it in the .claude/skills/ directory.

model-eval-benchmark

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper