Name: llm-eval-multi-model
Availability: InStock
Author: saintgo7

System Documentation

What problem does it solve?

This Skill solves the challenge of comparing multiple LLM models fairly by measuring latency, throughput, token usage, and response quality under identical prompts and sampling settings.

Core Features & Use Cases

Parallel multi-endpoint evaluation: Sends the same prompt to multiple LLM endpoints concurrently to compare performance at the same time.
Production-style latency and token metrics: Captures TTFT, TPOT/throughput, and prompt/completion token usage with p50/p95/p99 summaries and warm-up handling.
Quality scoring options: Supports ground-truth grading for deterministic tasks and optional LLM-as-judge for subjective quality comparisons, including tool-calling accuracy evaluation patterns.

Quick Start

Run the evaluation by calling the skill install script and then request the AI to compare two models with the same prompt set, collecting latency and quality metrics into a single report.

Please help me install this Skill: Name: llm-eval-multi-model Download link: https://github.com/saintgo7/claude-skills/archive/main.zip#llm-eval-multi-model Please download this .zip file, extract it, and install it in the .claude/skills/ directory.

llm-eval-multi-model

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper