llm-eval-multi-model
CommunityBenchmark multiple LLMs side-by-side.
Data & Analytics#llm-as-judge#llm evaluation#token usage#latency metrics#model benchmarking#asyncio concurrency
Authorsaintgo7
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill solves the challenge of comparing multiple LLM models fairly by measuring latency, throughput, token usage, and response quality under identical prompts and sampling settings.
Core Features & Use Cases
- Parallel multi-endpoint evaluation: Sends the same prompt to multiple LLM endpoints concurrently to compare performance at the same time.
- Production-style latency and token metrics: Captures TTFT, TPOT/throughput, and prompt/completion token usage with p50/p95/p99 summaries and warm-up handling.
- Quality scoring options: Supports ground-truth grading for deterministic tasks and optional LLM-as-judge for subjective quality comparisons, including tool-calling accuracy evaluation patterns.
Quick Start
Run the evaluation by calling the skill install script and then request the AI to compare two models with the same prompt set, collecting latency and quality metrics into a single report.
Dependency Matrix
Required Modules
httpx
Components
scripts
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: llm-eval-multi-model Download link: https://github.com/saintgo7/claude-skills/archive/main.zip#llm-eval-multi-model Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.