troubleshoot-nvidia-gpu

Official

Diagnose GPU failures fast via Netdata + MCP

Authornetdata
Version1.0.0
Installs0

System Documentation

What problem does it solve?

It helps you triage and remediate Nvidia GPU issues by identifying the dominant failure archetype (OOM, thermal-throttle cascade, silent degradation, interconnect straggler, or GPU hang) using Netdata signals accessed through an MCP query flow.

Core Features & Use Cases

  • Structured Nvidia Gpu triage tree: Routes an agent through domain-specific rule files aligned to the Netdata operator playbook’s signal domains (availability, thermal/power, memory, compute/utilization, interconnect, errors/Xid, plus support domains).
  • MCP-driven verification: Uses MCP calls to discover nodes, pull the last-window metrics, rank anomalies, correlate signals, and confirm whether remediation actually returns signals to expected ranges.
  • Failure-mode coverage: Targets common real-world incidents like memory exhaustion (OOM), thermal runaway with throttling, ECC-related silent corruption, PCIe/NVLink degradation patterns, and catastrophic “fallen off bus” behaviors (including XID 79).
  • Operational on-call workflow: Designed for situations where an on-call engineer needs a paging-ready, evidence-based triage path tied to Netdata alerts on a specific GPU host.

Quick Start

Use the troubleshoot-nvidia-gpu skill to diagnose a failing Nvidia GPU by querying the Netdata instance through MCP for the last 15 to 30 minutes of nvidia_smi signals, matching anomalies to the appropriate rule-file domain, and running the verification queries to confirm remediation effectiveness.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: troubleshoot-nvidia-gpu
Download link: https://github.com/netdata/skills/archive/main.zip#troubleshoot-nvidia-gpu

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.