mcore-run-on-slurm

Community

Launch multi-node Megatron-LM on SLURM clusters.

Authorsayalinvidia
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Launch and manage distributed Megatron-LM training on SLURM clusters, providing a repeatable pattern that reduces setup time and configuration errors.

Core Features & Use Cases

  • Reusable sbatch skeleton for multi-node GPU training with Megatron-LM.
  • Environment and launcher guidance for torch.distributed.run, including MASTER_ADDR/MASTER_PORT and WORLD_SIZE calculations.
  • CUDA_DEVICE_MAX_CONNECTIONS recommendations across hardware generations and parallelism modes.
  • Container deployment considerations and monitoring guidance.
  • Use Case: Validate a new SLURM cluster by quickly spinning up a Megatron-LM job to test scaling and fault diagnosis.

Quick Start

Create and submit a SLURM sbatch script using the provided skeleton to start a multi-node Megatron-LM training job.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: mcore-run-on-slurm
Download link: https://github.com/sayalinvidia/sayali-skills-test/archive/main.zip#mcore-run-on-slurm

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 510,000+ vetted skills library on demand.