llm-evaluation
CommunityMaster LLM evaluation for accurate, reliable AI apps.
System Documentation
What problem does it solve?
Systematically evaluating Large Language Models (LLMs) and their applications is crucial for ensuring performance, reliability, and safety, but it's a complex task. This Skill provides a comprehensive guide to automated metrics, human feedback, and LLM-as-Judge techniques.
Core Features & Use Cases
- Automated Metrics: Covers BLEU, ROUGE, BERTScore for text generation, and MRR/NDCG for retrieval.
- Human Evaluation: Guides on setting up annotation tasks and measuring inter-rater agreement.
- LLM-as-Judge: Explains how to use stronger LLMs (e.g., GPT-4) to evaluate outputs from other models.
- Use Case: When developing a new LLM-powered chatbot, this Skill helps you set up an evaluation framework to compare different prompt variations, detect performance regressions, and ensure the chatbot provides accurate and helpful responses.
Quick Start
Example: Basic LLM evaluation suite
This demonstrates defining metrics and running an evaluation on test cases.
from llm_eval import EvaluationSuite, Metric
suite = EvaluationSuite([ Metric.accuracy(), Metric.bleu(), Metric.bertscore(), Metric.custom(name="groundedness", fn=check_groundedness) ])
test_cases = [ { "input": "What is the capital of France?", "expected": "Paris", "context": "France is a country in Europe. Paris is its capital." }, # ... more test cases ]
results = suite.evaluate(model=your_model, test_cases=test_cases)
Dependency Matrix
Required Modules
Components
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: llm-evaluation Download link: https://github.com/camoneart/claude-code/archive/main.zip#llm-evaluation Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.