troubleshoot-nvidia-dcgm
OfficialDiagnose NVIDIA DCGM failures fast
Authornetdata
Version1.0.0
Installs0
System Documentation
What problem does it solve?
It helps you troubleshoot NVIDIA DCGM issues by turning unclear symptoms (errors, throttling, hangs, or silent slowdowns) into a structured triage and remediation recommendation path using Netdata-observed signals.
Core Features & Use Cases
- Failure-mode triage tree for NVIDIA DCGM: Guides investigation across HBM degradation, thermal throttling/runaway, NVLink errors, GPU hangs (XID), silent performance degradation, and power capping patterns.
- MCP-driven signal verification via Netdata: Queries Netdata with MCP to confirm health signals in specific
dcgm.*contexts over a recent time window and ranks anomalies first. - Playbook-aligned remediation loop: Recommends the smallest remediation hinted by the operator playbook, then verifies that signals return to expected ranges.
Quick Start
Ask an AI agent to troubleshoot NVIDIA DCGM health using Netdata MCP by checking for HBM degradation, thermal throttling, NVLink errors, XID-related hangs, and silent performance degradation, then provide the most likely remediation and the verification queries to confirm recovery.
Dependency Matrix
Required Modules
None requiredComponents
references
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: troubleshoot-nvidia-dcgm Download link: https://github.com/netdata/skills/archive/main.zip#troubleshoot-nvidia-dcgm Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.