nemo-mbridge-perf-moe-optimization-workflow

Community

MoE throughput tuning workflow for Megatron Bridge

Authorsayalinvidia
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This workflow provides a structured approach to diagnose and improve throughput for MoE training on Megatron Bridge, addressing memory, communication, and compute bottlenecks.

Core Features & Use Cases

  • Phase-driven optimization using the Three Walls framework to guide memory, communication, and compute improvements.
  • Parallel Folding decouples attention and MoE parallelism, enabling scalable multi-GPU configurations.
  • Dispatcher selection, FP8 mapping guidance, and CUDA graph considerations to accelerate MoE workloads across hardware (e.g., Hopper and Blackwell).
  • Use cases include diagnosing throughput regressions after commits and performing end-to-end MoE throughput tuning sweeps.

Quick Start

Initiate a three-phase MoE optimization by guiding the agent through fit, scale, profile, and retune steps using the Parallel Folding meshes and recommended dispatcher and FP8 mappings.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: nemo-mbridge-perf-moe-optimization-workflow
Download link: https://github.com/sayalinvidia/sayali-skills-test/archive/main.zip#nemo-mbridge-perf-moe-optimization-workflow

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 510,000+ vetted skills library on demand.