gstack-benchmark-models
CommunityPick the best model with real data.
Data & Analytics#latency#token usage#cost#quality scoring#prompt evaluation#model benchmarking#LLM providers
Authoranilveersingh1308
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill removes guesswork when choosing which AI model is best for a given task by running the same prompt across providers and comparing measurable outcomes like latency and cost.
Core Features & Use Cases
- Cross-model prompt shootouts: Executes the same prompt through Claude, GPT (via Codex CLI), and Gemini side-by-side to compare performance.
- Actionable comparisons: Reports latency, token usage, and cost, and can optionally add an LLM judge score for output quality.
- Safe preflight workflow: Forces an auth-aware dry-run first so you can see what providers are ready before any paid benchmark run.
- Repeatable baselines: Can save benchmark results as JSON so you can compare future runs and detect performance drift.
Quick Start
Run gstack-benchmark-models and choose a prompt (or select a benchmark-able gstack skill) to compare Claude, GPT, and Gemini on the same input.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: gstack-benchmark-models Download link: https://github.com/anilveersingh1308/copilot-skills/archive/main.zip#gstack-benchmark-models Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.