agentsop-metric-design

Community

Design bias-resistant evaluation metrics.

Authoragentsope
Version1.0.0
Installs0

System Documentation

What problem does it solve?

It prevents LLM optimization from gaming unreliable or biased “LLM-as-judge” scores by turning evaluation criteria into a decomposed, calibrated metric that reflects what humans actually want.

Core Features & Use Cases

  • Decomposed multi-criteria metrics: break holistic judging into 3–6 orthogonal yes/no sub-judges (e.g., factuality, on-topic, non-hedging) to reduce judge noise and axis collapse.
  • Compile vs eval semantics: return strict bool during compilation (e.g., optimizer search) and float during evaluation/reporting for stable optimization behavior.
  • Deterministic length penalty: apply an explicit scalar length control so verbosity bias cannot “win” the optimization.
  • Judge-bias mitigation and probes: run bias-probe checks (length, self-preference, position, rubric-order) and calibrate against humans before any compile.
  • RAG evaluation triad support: align metrics to Faithfulness + Relevancy + Retriever quality so systems don’t look faithful while still being wrong or irrelevant.

Quick Start

Use agentsop-metric-design to generate a metric function and calibration receipt, then pass that metric into your DSPy optimizer or LlamaIndex eval loop to gate and regress changes quantitatively.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: agentsop-metric-design
Download link: https://github.com/agentsope/SkillAlchemy/archive/main.zip#agentsop-metric-design

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.