gpu-train-monitor

Community

Real-time multi-GPU training monitoring.

AuthorGonglitian
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill helps ML engineers and AI researchers quickly assess and diagnose multi-GPU training jobs by aggregating GPU status, training progress, and log data into actionable insights.

Core Features & Use Cases

  • GPU status & utilization: snapshot GPUs' usage, memory, temperature to prevent bottlenecks.
  • Training progress & logs: parse active runs to extract epoch/step, loss trends, learning rate, and throughput.
  • Integrated monitoring: support WandB integration detection and produce concise status reports for team standups.
  • Use Case: A data scientist monitors a multi-node training job and receives a summary when the job stalls or high memory usage is detected.

Quick Start

Ask the skill to generate a real-time status report for your current multi-GPU training job.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: gpu-train-monitor
Download link: https://github.com/Gonglitian/agent-skills/archive/main.zip#gpu-train-monitor

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.