troubleshoot-nvidia-dcgm

Official

Diagnose NVIDIA DCGM failures fast

Authornetdata
Version1.0.0
Installs0

System Documentation

What problem does it solve?

It helps you troubleshoot NVIDIA DCGM issues by turning unclear symptoms (errors, throttling, hangs, or silent slowdowns) into a structured triage and remediation recommendation path using Netdata-observed signals.

Core Features & Use Cases

  • Failure-mode triage tree for NVIDIA DCGM: Guides investigation across HBM degradation, thermal throttling/runaway, NVLink errors, GPU hangs (XID), silent performance degradation, and power capping patterns.
  • MCP-driven signal verification via Netdata: Queries Netdata with MCP to confirm health signals in specific dcgm.* contexts over a recent time window and ranks anomalies first.
  • Playbook-aligned remediation loop: Recommends the smallest remediation hinted by the operator playbook, then verifies that signals return to expected ranges.

Quick Start

Ask an AI agent to troubleshoot NVIDIA DCGM health using Netdata MCP by checking for HBM degradation, thermal throttling, NVLink errors, XID-related hangs, and silent performance degradation, then provide the most likely remediation and the verification queries to confirm recovery.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: troubleshoot-nvidia-dcgm
Download link: https://github.com/netdata/skills/archive/main.zip#troubleshoot-nvidia-dcgm

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.