enhance-evals
CommunityMake recommendation evals more reliable
Authorstopitdan
Version1.0.0
Installs0
System Documentation
What problem does it solve?
Improve the boredgame.lol recommendation evaluation system by addressing brittle pass/fail rules, incorrect idealGames, limited metrics, weak LLM judging, sparse coverage, and low statistical rigor so that eval results better reflect true recommendation quality.
Core Features & Use Cases
- Case validation & repair: Find and fix wrong or non-existent idealGames in evals/cases.json, validate generated cases against the game catalog, and regenerate cases from the appropriate generators.
- Metric and reporting improvements: Add serendipity, familiarity-balance, catalog-coverage, and confidence-interval calculations; update runner reporting and summary views for trend and per-game analysis.
- LLM judge enhancements: Replace single holistic scores with per-dimension 0-2 ratings, require chain-of-thought explanations, and optionally support pairwise comparisons for A/B judge evaluations.
- Operational workflow: Validate changes with the provided validation script, always regenerate cases.json after generator edits, run quick evals to detect regressions, and document work in evals/EVAL-WORKLOG.md.
Quick Start
Improve the eval system by fixing incorrect idealGames in evals/cases.json, adding serendipity and familiarity metrics to evals/metrics.ts, and updating evals/llm-judge.ts to use per-dimension scoring and chain-of-thought.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: enhance-evals Download link: https://github.com/stopitdan/recommendagame/archive/main.zip#enhance-evals Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.