agentsop-metric-design
CommunityDesign bias-resistant evaluation metrics.
Authoragentsope
Version1.0.0
Installs0
System Documentation
What problem does it solve?
It prevents LLM optimization from gaming unreliable or biased “LLM-as-judge” scores by turning evaluation criteria into a decomposed, calibrated metric that reflects what humans actually want.
Core Features & Use Cases
- Decomposed multi-criteria metrics: break holistic judging into 3–6 orthogonal yes/no sub-judges (e.g., factuality, on-topic, non-hedging) to reduce judge noise and axis collapse.
- Compile vs eval semantics: return strict
boolduring compilation (e.g., optimizer search) andfloatduring evaluation/reporting for stable optimization behavior. - Deterministic length penalty: apply an explicit scalar length control so verbosity bias cannot “win” the optimization.
- Judge-bias mitigation and probes: run bias-probe checks (length, self-preference, position, rubric-order) and calibrate against humans before any compile.
- RAG evaluation triad support: align metrics to Faithfulness + Relevancy + Retriever quality so systems don’t look faithful while still being wrong or irrelevant.
Quick Start
Use agentsop-metric-design to generate a metric function and calibration receipt, then pass that metric into your DSPy optimizer or LlamaIndex eval loop to gate and regress changes quantitatively.
Dependency Matrix
Required Modules
None requiredComponents
references
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: agentsop-metric-design Download link: https://github.com/agentsope/SkillAlchemy/archive/main.zip#agentsop-metric-design Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.