nemo-mbridge-perf-moe-optimization-workflow
CommunityMoE throughput tuning workflow for Megatron Bridge
Authorsayalinvidia
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This workflow provides a structured approach to diagnose and improve throughput for MoE training on Megatron Bridge, addressing memory, communication, and compute bottlenecks.
Core Features & Use Cases
- Phase-driven optimization using the Three Walls framework to guide memory, communication, and compute improvements.
- Parallel Folding decouples attention and MoE parallelism, enabling scalable multi-GPU configurations.
- Dispatcher selection, FP8 mapping guidance, and CUDA graph considerations to accelerate MoE workloads across hardware (e.g., Hopper and Blackwell).
- Use cases include diagnosing throughput regressions after commits and performing end-to-end MoE throughput tuning sweeps.
Quick Start
Initiate a three-phase MoE optimization by guiding the agent through fit, scale, profile, and retune steps using the Parallel Folding meshes and recommended dispatcher and FP8 mappings.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: nemo-mbridge-perf-moe-optimization-workflow Download link: https://github.com/sayalinvidia/sayali-skills-test/archive/main.zip#nemo-mbridge-perf-moe-optimization-workflow Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 510,000+ vetted skills library on demand.