make-eval

Community

Prove your LLM classifier is reliable.

AuthorDevOtts
Version1.0.0
Installs0

System Documentation

What problem does it solve?

LLM-backed classifiers can silently regress, misroute, or fail guardrails, and teams often only discover that after users are impacted. This Skill builds a small, deterministic evaluation harness that measures exact-match accuracy and pinpoints dangerous misclassifications with an auditable confusion matrix.

Core Features & Use Cases

  • Closed-label exact-match scoring: scores predictions with strict got === expected for safer, more deterministic validation.
  • Adversarial dataset guidance: helps you create a compact dataset covering happy paths, empty/garbage inputs, prompt injection, ambiguous cases, contradictions, non-English, and the single most dangerous failure mode.
  • CI-friendly regression gate: prints a confusion matrix and exits non-zero when accuracy falls below a threshold or any DANGEROUS_* category is missed.
  • Optional LangSmith mode: when LangSmith is wired, it generates upload/run scripts, wires pnpm eval workflows, and reports experiment + dataset links on every run.

Quick Start

Ask an AI agent to generate the eval harness for your classifier by providing your closed label set and the boundary function that returns the predicted label, then run it with your dataset to get an exact-match confusion matrix.

Dependency Matrix

Required Modules

langsmith

Components

scripts

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: make-eval
Download link: https://github.com/DevOtts/all-skills-you-will-ever-need/archive/main.zip#make-eval

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.