llm-evaluation

Community

Master LLM evaluation for accurate, reliable AI apps.

Authorcamoneart
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Systematically evaluating Large Language Models (LLMs) and their applications is crucial for ensuring performance, reliability, and safety, but it's a complex task. This Skill provides a comprehensive guide to automated metrics, human feedback, and LLM-as-Judge techniques.

Core Features & Use Cases

  • Automated Metrics: Covers BLEU, ROUGE, BERTScore for text generation, and MRR/NDCG for retrieval.
  • Human Evaluation: Guides on setting up annotation tasks and measuring inter-rater agreement.
  • LLM-as-Judge: Explains how to use stronger LLMs (e.g., GPT-4) to evaluate outputs from other models.
  • Use Case: When developing a new LLM-powered chatbot, this Skill helps you set up an evaluation framework to compare different prompt variations, detect performance regressions, and ensure the chatbot provides accurate and helpful responses.

Quick Start

Example: Basic LLM evaluation suite

This demonstrates defining metrics and running an evaluation on test cases.

from llm_eval import EvaluationSuite, Metric

suite = EvaluationSuite([ Metric.accuracy(), Metric.bleu(), Metric.bertscore(), Metric.custom(name="groundedness", fn=check_groundedness) ])

test_cases = [ { "input": "What is the capital of France?", "expected": "Paris", "context": "France is a country in Europe. Paris is its capital." }, # ... more test cases ]

results = suite.evaluate(model=your_model, test_cases=test_cases)

Dependency Matrix

Required Modules

nltkrouge-scorebert-scoretransformersdetoxifyscipynumpyscikit-learn

Components

assetsreferences

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: llm-evaluation
Download link: https://github.com/camoneart/claude-code/archive/main.zip#llm-evaluation

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.