llm-benchmark

Community

Benchmark LLM prompts with integrity.

Authorandreasronge
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Design rigorous, statistically powered benchmarks to evaluate LLM prompts and configurations, ensuring experiments produce actionable, unbiased conclusions.

Core Features & Use Cases

  • Rigorous experimental design templates for policy vs mechanism benchmarks
  • Data leakage prevention guidance to keep prompts domain-blind and test data separate
  • Per-turn metrics and statistical analysis guidance to interpret results across models and tasks
  • Use cases include prompt ablation studies, A/B testing prompts, and design of held-out test suites

Quick Start

Run a baseline benchmark on your prompts and report observed pass rates, per-turn metrics, and recommended sample sizes.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: llm-benchmark
Download link: https://github.com/andreasronge/ptc_runner/archive/main.zip#llm-benchmark

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.