sudcgm
CommunityStreamlined large-scale GPU management and monitoring.
Authordongg622
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill automates the centralized monitoring, diagnosis, and management of large GPU clusters, simplifying data center operations.
Core Features & Use Cases
- Device Discovery and Monitoring: Automatically find and monitor GPU health, performance, and configuration.
- Cluster Diagnostics and Fault Detection: Perform deep diagnostics to identify hardware issues and preempt failures.
- Management and Policy Enforcement: Create GPU groups, set policies, and collect performance metrics for scalable cluster management.
- Use Case: A data center engineer needs to regularly check GPU health across hundreds of servers, diagnose faults, and enforce power or temperature policies quickly using scripted commands.
Quick Start
Run the discovery command to list GPUs, then perform health checks and diagnostics on selected devices.
Dependency Matrix
Required Modules
brsmisuvs
Components
scriptsreferences
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: sudcgm Download link: https://github.com/dongg622/china-ai-chip-skill/archive/main.zip#sudcgm Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.