rundale-bench

Name: rundale-bench
Availability: InStock
Author: dmooney

Community

Benchmark Rundale model quality end-to-end.

Data & Analytics #A/B testing #benchmarking #performance metrics #model evaluation #cost accounting #rubric scoring #LLM judging

Authordmooney

Version1.0.0

Installs0

System Documentation

What problem does it solve?

It removes the guesswork of comparing and evaluating Rundale model outputs by providing repeatable, rubric-scored quality benchmarks with cost and performance reporting.

Core Features & Use Cases

Targeted quality evaluation (bench mode): Runs all relevant slices (dialogue, intent, reaction, tier2-sim, tier3-sim, gaeilge) and a perf probe for a single model+provider target, using Sonnet 4.6 judges to score outcomes and then ingests results into a final report.
Queue-based judging (drain-queue mode): Resumes or completes evaluation by draining already-produced judging bundles from an on-disk queue without re-running generation work.
Dialogue-only blind A/B/N (eval-dialogue mode): Generates dialogue samples with Opus and uses a Sonnet 4.6 judge to score dialogue candidates across a 5-axis rubric, including per-candidate cost.
Practical failure handling: Flags provider quirks (e.g., high error rates) and supports re-dispatching missing or failed judge bundles before finalizing aggregates.

Quick Start

Run model-quality evaluation with: use the skill to benchmark a single target by saying: evaluate <model@base_url#env:VAR> on <provider>.

rundale-bench

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper