rundale-bench

Community

Benchmark Rundale model quality end-to-end.

Authordmooney
Version1.0.0
Installs0

System Documentation

What problem does it solve?

It removes the guesswork of comparing and evaluating Rundale model outputs by providing repeatable, rubric-scored quality benchmarks with cost and performance reporting.

Core Features & Use Cases

  • Targeted quality evaluation (bench mode): Runs all relevant slices (dialogue, intent, reaction, tier2-sim, tier3-sim, gaeilge) and a perf probe for a single model+provider target, using Sonnet 4.6 judges to score outcomes and then ingests results into a final report.
  • Queue-based judging (drain-queue mode): Resumes or completes evaluation by draining already-produced judging bundles from an on-disk queue without re-running generation work.
  • Dialogue-only blind A/B/N (eval-dialogue mode): Generates dialogue samples with Opus and uses a Sonnet 4.6 judge to score dialogue candidates across a 5-axis rubric, including per-candidate cost.
  • Practical failure handling: Flags provider quirks (e.g., high error rates) and supports re-dispatching missing or failed judge bundles before finalizing aggregates.

Quick Start

Run model-quality evaluation with: use the skill to benchmark a single target by saying: evaluate <model@base_url#env:VAR> on <provider>.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: rundale-bench
Download link: https://github.com/dmooney/Rundale/archive/main.zip#rundale-bench

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.