gpu-training-acceleration

Community

Maximize PyTorch GPU throughput and memory.

Authordongzhuoyao
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Optimize PyTorch training speed and reduce GPU memory usage on CUDA GPUs by providing actionable, config-gated acceleration patterns, mixed precision guidance, fused optimizer kernels, Triton kernel fusion, and monitoring practices. It helps diagnose low GPU utilization, avoid out-of-memory failures, and safely adopt new compiler or kernel features without destabilizing training runs.

Core Features & Use Cases

  • Config-gated acceleration: Patterns to gate features like torch.compile, fused optimizers, and gradient checkpointing behind configuration so they can be toggled and logged.
  • Precision and kernel guidance: Advice for TF32, bf16/fp16 mixed precision, cuDNN autotuning, and when to prefer fused kernels or Triton implementations.
  • Failure handling and telemetry: Safe defaults, try/except fallbacks for torch.compile, and recommendations to always log acceleration state for reproducibility and debugging.
  • Memory and dataflow patterns: Strategies for latent-space training, contiguous memory enforcement, empty_cache placement, and NVCC build flags for custom CUDA extensions.
  • Use Case: Speed up transformer or generative model training on multi-GPU clusters by enabling TF32, using bf16 mixed precision where safe, compiling stable decoder submodules, and switching to fused AdamW for optimizer speedups.

Quick Start

Run a short fastrun with TF32 enabled, mixed precision set to bf16, fused optimizer turned on, and compile only stable-shape decoder submodules to validate throughput and memory before full-scale training.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: gpu-training-acceleration
Download link: https://github.com/dongzhuoyao/tao-research-skills/archive/main.zip#gpu-training-acceleration

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.