enhance-evals

Community

Make recommendation evals more reliable

Authorstopitdan
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Improve the boredgame.lol recommendation evaluation system by addressing brittle pass/fail rules, incorrect idealGames, limited metrics, weak LLM judging, sparse coverage, and low statistical rigor so that eval results better reflect true recommendation quality.

Core Features & Use Cases

  • Case validation & repair: Find and fix wrong or non-existent idealGames in evals/cases.json, validate generated cases against the game catalog, and regenerate cases from the appropriate generators.
  • Metric and reporting improvements: Add serendipity, familiarity-balance, catalog-coverage, and confidence-interval calculations; update runner reporting and summary views for trend and per-game analysis.
  • LLM judge enhancements: Replace single holistic scores with per-dimension 0-2 ratings, require chain-of-thought explanations, and optionally support pairwise comparisons for A/B judge evaluations.
  • Operational workflow: Validate changes with the provided validation script, always regenerate cases.json after generator edits, run quick evals to detect regressions, and document work in evals/EVAL-WORKLOG.md.

Quick Start

Improve the eval system by fixing incorrect idealGames in evals/cases.json, adding serendipity and familiarity metrics to evals/metrics.ts, and updating evals/llm-judge.ts to use per-dimension scoring and chain-of-thought.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: enhance-evals
Download link: https://github.com/stopitdan/recommendagame/archive/main.zip#enhance-evals

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.