troubleshoot-nvidia-gpu
OfficialDiagnose GPU failures fast via Netdata + MCP
System Documentation
What problem does it solve?
It helps you triage and remediate Nvidia GPU issues by identifying the dominant failure archetype (OOM, thermal-throttle cascade, silent degradation, interconnect straggler, or GPU hang) using Netdata signals accessed through an MCP query flow.
Core Features & Use Cases
- Structured Nvidia Gpu triage tree: Routes an agent through domain-specific rule files aligned to the Netdata operator playbook’s signal domains (availability, thermal/power, memory, compute/utilization, interconnect, errors/Xid, plus support domains).
- MCP-driven verification: Uses MCP calls to discover nodes, pull the last-window metrics, rank anomalies, correlate signals, and confirm whether remediation actually returns signals to expected ranges.
- Failure-mode coverage: Targets common real-world incidents like memory exhaustion (OOM), thermal runaway with throttling, ECC-related silent corruption, PCIe/NVLink degradation patterns, and catastrophic “fallen off bus” behaviors (including XID 79).
- Operational on-call workflow: Designed for situations where an on-call engineer needs a paging-ready, evidence-based triage path tied to Netdata alerts on a specific GPU host.
Quick Start
Use the troubleshoot-nvidia-gpu skill to diagnose a failing Nvidia GPU by querying the Netdata instance through MCP for the last 15 to 30 minutes of nvidia_smi signals, matching anomalies to the appropriate rule-file domain, and running the verification queries to confirm remediation effectiveness.
Dependency Matrix
Required Modules
None requiredComponents
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: troubleshoot-nvidia-gpu Download link: https://github.com/netdata/skills/archive/main.zip#troubleshoot-nvidia-gpu Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.