perf-host-analysis

Community

Pinpoint TensorRT-LLM host bottlenecks fast

Authoryo-steven
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill determines whether host/CPU overhead is the bottleneck in TensorRT-LLM inference by analyzing nsys traces, then isolates the specific host-side operations that regressed.

Core Features & Use Cases

  • Bottleneck detection (YES/NO): Uses GPU idle ratio, CUDA launch overhead ratio, GPU utilization, NCCL caveat, and (optionally) host-prep exposed/impact/attribution to confirm whether host overhead is actually harming throughput.
  • Root-cause analysis: Compares baseline vs target traces to quantify per-step wall-time regressions and produce an NVTX per-operation breakdown for request-management and scheduling overhead.
  • Phase-aware diagnosis: Separates context (eager) vs generation (graph replay) behavior so a regression masked in aggregate can be surfaced reliably.
  • Optional kernel-level drill-down: When NVTX indicates time inside GPU dispatch, further analyzes inter-kernel gaps, eager-vs-graph coverage, repeating patterns, and multi-rank straggler behavior.

Quick Start

Use the perf-host-analysis skill to analyze a single nsys SQLite trace file and produce a host-overhead verdict with metric evidence.

Dependency Matrix

Required Modules

None required

Components

scriptsreferences

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: perf-host-analysis
Download link: https://github.com/yo-steven/skills-exploration-20260522/archive/main.zip#perf-host-analysis

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.