ml-profile

Community

Pinpoint ML bottlenecks and optimize GPUs.

Authornishide-dev
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill helps you identify why training is slow or unstable by profiling where time and memory are going, then guiding targeted performance fixes.

Core Features & Use Cases

  • Training performance profiling: Run PyTorch Lightning profilers (simple, advanced, and PyTorch profiler) for operator-level insight and trace visualization in TensorBoard.
  • Data loading diagnostics: Measure DataLoader throughput and find an optimal num_workers to reduce input bottlenecks.
  • GPU utilization and memory checks: Monitor GPU utilization and GPU memory usage to detect compute starvation, inefficient data pipelines, and OOM risks.
  • Use case: When a training run shows low GPU utilization and long iteration gaps, profile the DataLoader and training steps, then adjust worker counts, preprocessing, and batch/memory strategies based on the results.

Quick Start

Use ml-profile when you notice low GPU utilization or OOM symptoms by asking for a short profiling run and interpreting the traces in TensorBoard.

Dependency Matrix

Required Modules

matplotlibhydra-corepytorch-lightningtorch

Components

scripts

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: ml-profile
Download link: https://github.com/nishide-dev/claude-code-ml-research/archive/main.zip#ml-profile

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.