Name: gpu-train-monitor
Availability: InStock
Author: Gonglitian

System Documentation

What problem does it solve?

This Skill helps ML engineers and AI researchers quickly assess and diagnose multi-GPU training jobs by aggregating GPU status, training progress, and log data into actionable insights.

Core Features & Use Cases

GPU status & utilization: snapshot GPUs' usage, memory, temperature to prevent bottlenecks.
Training progress & logs: parse active runs to extract epoch/step, loss trends, learning rate, and throughput.
Integrated monitoring: support WandB integration detection and produce concise status reports for team standups.
Use Case: A data scientist monitors a multi-node training job and receives a summary when the job stalls or high memory usage is detected.

Quick Start

Ask the skill to generate a real-time status report for your current multi-GPU training job.

Please help me install this Skill: Name: gpu-train-monitor Download link: https://github.com/Gonglitian/agent-skills/archive/main.zip#gpu-train-monitor Please download this .zip file, extract it, and install it in the .claude/skills/ directory.

gpu-train-monitor

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper