k8s-nvidia-node-troubleshooting

Community

Diagnose GPU node health and driver exposure.

AuthorHugoAlmeidaMoreira
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Kubernetes clusters can experience GPU workload issues when a node's NVIDIA driver, GPU exposure, or device plugin health is misaligned. This Skill helps operators determine whether NVIDIA GPUs are being exposed correctly and whether the driver and device plugin are functioning, enabling faster diagnosis.

Core Features & Use Cases

  • Node readiness verification: confirm the target node is Ready and has GPU capacity available.
  • GPU exposure validation: verify that GPUs are visible to the scheduler and that the NVIDIA device plugin is registered.
  • Guided troubleshooting workflow: provide steps to read driver versions and check device-plugin health across cluster nodes.

Quick Start

Identify whether a Kubernetes node is Ready, GPUs are exposed, and the NVIDIA device plugin is healthy on the target node.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: k8s-nvidia-node-troubleshooting
Download link: https://github.com/HugoAlmeidaMoreira/zeus-agent/archive/main.zip#k8s-nvidia-node-troubleshooting

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.