perf-host-analysis
CommunityPinpoint TensorRT-LLM host bottlenecks fast
Data & Analytics#profiling#trace-analysis#performance-regression#nsys#host-overhead#tensorRT-LLM#NVTX
Authoryo-steven
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill determines whether host/CPU overhead is the bottleneck in TensorRT-LLM inference by analyzing nsys traces, then isolates the specific host-side operations that regressed.
Core Features & Use Cases
- Bottleneck detection (YES/NO): Uses GPU idle ratio, CUDA launch overhead ratio, GPU utilization, NCCL caveat, and (optionally) host-prep exposed/impact/attribution to confirm whether host overhead is actually harming throughput.
- Root-cause analysis: Compares baseline vs target traces to quantify per-step wall-time regressions and produce an NVTX per-operation breakdown for request-management and scheduling overhead.
- Phase-aware diagnosis: Separates context (eager) vs generation (graph replay) behavior so a regression masked in aggregate can be surfaced reliably.
- Optional kernel-level drill-down: When NVTX indicates time inside GPU dispatch, further analyzes inter-kernel gaps, eager-vs-graph coverage, repeating patterns, and multi-rank straggler behavior.
Quick Start
Use the perf-host-analysis skill to analyze a single nsys SQLite trace file and produce a host-overhead verdict with metric evidence.
Dependency Matrix
Required Modules
None requiredComponents
scriptsreferences
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: perf-host-analysis Download link: https://github.com/yo-steven/skills-exploration-20260522/archive/main.zip#perf-host-analysis Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.