agentsop-metric-design

Name: agentsop-metric-design
Availability: InStock
Author: agentsope

Community

Design bias-resistant evaluation metrics.

Data & Analytics #llm-as-judge #rag #calibration #dspy #bias-mitigation #optimizer #evaluation-metrics

Authoragentsope

Version1.0.0

Installs0

System Documentation

What problem does it solve?

It prevents LLM optimization from gaming unreliable or biased “LLM-as-judge” scores by turning evaluation criteria into a decomposed, calibrated metric that reflects what humans actually want.

Core Features & Use Cases

Decomposed multi-criteria metrics: break holistic judging into 3–6 orthogonal yes/no sub-judges (e.g., factuality, on-topic, non-hedging) to reduce judge noise and axis collapse.
Compile vs eval semantics: return strict bool during compilation (e.g., optimizer search) and float during evaluation/reporting for stable optimization behavior.
Deterministic length penalty: apply an explicit scalar length control so verbosity bias cannot “win” the optimization.
Judge-bias mitigation and probes: run bias-probe checks (length, self-preference, position, rubric-order) and calibrate against humans before any compile.
RAG evaluation triad support: align metrics to Faithfulness + Relevancy + Retriever quality so systems don’t look faithful while still being wrong or irrelevant.

Quick Start

Use agentsop-metric-design to generate a metric function and calibration receipt, then pass that metric into your DSPy optimizer or LlamaIndex eval loop to gate and regress changes quantitatively.

agentsop-metric-design

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper