make-eval

Name: make-eval
Availability: InStock
Author: DevOtts

Community

Prove your LLM classifier is reliable.

Software Engineering #regression testing #guardrails #LangSmith #confusion matrix #llm evals #classifier testing #exact match

AuthorDevOtts

Version1.0.0

Installs0

System Documentation

What problem does it solve?

LLM-backed classifiers can silently regress, misroute, or fail guardrails, and teams often only discover that after users are impacted. This Skill builds a small, deterministic evaluation harness that measures exact-match accuracy and pinpoints dangerous misclassifications with an auditable confusion matrix.

Core Features & Use Cases

Closed-label exact-match scoring: scores predictions with strict got === expected for safer, more deterministic validation.
Adversarial dataset guidance: helps you create a compact dataset covering happy paths, empty/garbage inputs, prompt injection, ambiguous cases, contradictions, non-English, and the single most dangerous failure mode.
CI-friendly regression gate: prints a confusion matrix and exits non-zero when accuracy falls below a threshold or any DANGEROUS_* category is missed.
Optional LangSmith mode: when LangSmith is wired, it generates upload/run scripts, wires pnpm eval workflows, and reports experiment + dataset links on every run.

Quick Start

Ask an AI agent to generate the eval harness for your classifier by providing your closed label set and the boundary function that returns the predicted label, then run it with your dataset to get an exact-match confusion matrix.

make-eval

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper