gpu-train-monitor
CommunityReal-time multi-GPU training monitoring.
AuthorGonglitian
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill helps ML engineers and AI researchers quickly assess and diagnose multi-GPU training jobs by aggregating GPU status, training progress, and log data into actionable insights.
Core Features & Use Cases
- GPU status & utilization: snapshot GPUs' usage, memory, temperature to prevent bottlenecks.
- Training progress & logs: parse active runs to extract epoch/step, loss trends, learning rate, and throughput.
- Integrated monitoring: support WandB integration detection and produce concise status reports for team standups.
- Use Case: A data scientist monitors a multi-node training job and receives a summary when the job stalls or high memory usage is detected.
Quick Start
Ask the skill to generate a real-time status report for your current multi-GPU training job.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: gpu-train-monitor Download link: https://github.com/Gonglitian/agent-skills/archive/main.zip#gpu-train-monitor Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.