rundale-bench
CommunityBenchmark Rundale model quality end-to-end.
Data & Analytics#A/B testing#benchmarking#performance metrics#model evaluation#cost accounting#rubric scoring#LLM judging
Authordmooney
Version1.0.0
Installs0
System Documentation
What problem does it solve?
It removes the guesswork of comparing and evaluating Rundale model outputs by providing repeatable, rubric-scored quality benchmarks with cost and performance reporting.
Core Features & Use Cases
- Targeted quality evaluation (bench mode): Runs all relevant slices (dialogue, intent, reaction, tier2-sim, tier3-sim, gaeilge) and a perf probe for a single model+provider target, using Sonnet 4.6 judges to score outcomes and then ingests results into a final report.
- Queue-based judging (drain-queue mode): Resumes or completes evaluation by draining already-produced judging bundles from an on-disk queue without re-running generation work.
- Dialogue-only blind A/B/N (eval-dialogue mode): Generates dialogue samples with Opus and uses a Sonnet 4.6 judge to score dialogue candidates across a 5-axis rubric, including per-candidate cost.
- Practical failure handling: Flags provider quirks (e.g., high error rates) and supports re-dispatching missing or failed judge bundles before finalizing aggregates.
Quick Start
Run model-quality evaluation with: use the skill to benchmark a single target by saying: evaluate <model@base_url#env:VAR> on <provider>.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: rundale-bench Download link: https://github.com/dmooney/Rundale/archive/main.zip#rundale-bench Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.