llm-benchmark
CommunityBenchmark LLM prompts with integrity.
Data & Analytics#llm#benchmark#experimentation#prompt-optimization#statistical-analysis#data-leakage#per-turn-metrics
Authorandreasronge
Version1.0.0
Installs0
System Documentation
What problem does it solve?
Design rigorous, statistically powered benchmarks to evaluate LLM prompts and configurations, ensuring experiments produce actionable, unbiased conclusions.
Core Features & Use Cases
- Rigorous experimental design templates for policy vs mechanism benchmarks
- Data leakage prevention guidance to keep prompts domain-blind and test data separate
- Per-turn metrics and statistical analysis guidance to interpret results across models and tasks
- Use cases include prompt ablation studies, A/B testing prompts, and design of held-out test suites
Quick Start
Run a baseline benchmark on your prompts and report observed pass rates, per-turn metrics, and recommended sample sizes.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: llm-benchmark Download link: https://github.com/andreasronge/ptc_runner/archive/main.zip#llm-benchmark Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.