k8s-nvidia-node-troubleshooting
CommunityDiagnose GPU node health and driver exposure.
AuthorHugoAlmeidaMoreira
Version1.0.0
Installs0
System Documentation
What problem does it solve?
Kubernetes clusters can experience GPU workload issues when a node's NVIDIA driver, GPU exposure, or device plugin health is misaligned. This Skill helps operators determine whether NVIDIA GPUs are being exposed correctly and whether the driver and device plugin are functioning, enabling faster diagnosis.
Core Features & Use Cases
- Node readiness verification: confirm the target node is Ready and has GPU capacity available.
- GPU exposure validation: verify that GPUs are visible to the scheduler and that the NVIDIA device plugin is registered.
- Guided troubleshooting workflow: provide steps to read driver versions and check device-plugin health across cluster nodes.
Quick Start
Identify whether a Kubernetes node is Ready, GPUs are exposed, and the NVIDIA device plugin is healthy on the target node.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: k8s-nvidia-node-troubleshooting Download link: https://github.com/HugoAlmeidaMoreira/zeus-agent/archive/main.zip#k8s-nvidia-node-troubleshooting Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.