llm-eval-multi-model

Community

Benchmark multiple LLMs side-by-side.

Authorsaintgo7
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill solves the challenge of comparing multiple LLM models fairly by measuring latency, throughput, token usage, and response quality under identical prompts and sampling settings.

Core Features & Use Cases

  • Parallel multi-endpoint evaluation: Sends the same prompt to multiple LLM endpoints concurrently to compare performance at the same time.
  • Production-style latency and token metrics: Captures TTFT, TPOT/throughput, and prompt/completion token usage with p50/p95/p99 summaries and warm-up handling.
  • Quality scoring options: Supports ground-truth grading for deterministic tasks and optional LLM-as-judge for subjective quality comparisons, including tool-calling accuracy evaluation patterns.

Quick Start

Run the evaluation by calling the skill install script and then request the AI to compare two models with the same prompt set, collecting latency and quality metrics into a single report.

Dependency Matrix

Required Modules

httpx

Components

scripts

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: llm-eval-multi-model
Download link: https://github.com/saintgo7/claude-skills/archive/main.zip#llm-eval-multi-model

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.