perf-host-analysis

Name: perf-host-analysis
Availability: InStock
Author: yo-steven

Community

Pinpoint TensorRT-LLM host bottlenecks fast

Data & Analytics #profiling #trace-analysis #performance-regression #nsys #host-overhead #tensorRT-LLM #NVTX

Authoryo-steven

Version1.0.0

Installs0

System Documentation

What problem does it solve?

This Skill determines whether host/CPU overhead is the bottleneck in TensorRT-LLM inference by analyzing nsys traces, then isolates the specific host-side operations that regressed.

Core Features & Use Cases

Bottleneck detection (YES/NO): Uses GPU idle ratio, CUDA launch overhead ratio, GPU utilization, NCCL caveat, and (optionally) host-prep exposed/impact/attribution to confirm whether host overhead is actually harming throughput.
Root-cause analysis: Compares baseline vs target traces to quantify per-step wall-time regressions and produce an NVTX per-operation breakdown for request-management and scheduling overhead.
Phase-aware diagnosis: Separates context (eager) vs generation (graph replay) behavior so a regression masked in aggregate can be surfaced reliably.
Optional kernel-level drill-down: When NVTX indicates time inside GPU dispatch, further analyzes inter-kernel gaps, eager-vs-graph coverage, repeating patterns, and multi-rank straggler behavior.

Quick Start

Use the perf-host-analysis skill to analyze a single nsys SQLite trace file and produce a host-overhead verdict with metric evidence.

perf-host-analysis

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper