mcore-run-on-slurm
CommunityLaunch multi-node Megatron-LM on SLURM clusters.
Authorsayalinvidia
Version1.0.0
Installs0
System Documentation
What problem does it solve?
Launch and manage distributed Megatron-LM training on SLURM clusters, providing a repeatable pattern that reduces setup time and configuration errors.
Core Features & Use Cases
- Reusable sbatch skeleton for multi-node GPU training with Megatron-LM.
- Environment and launcher guidance for torch.distributed.run, including MASTER_ADDR/MASTER_PORT and WORLD_SIZE calculations.
- CUDA_DEVICE_MAX_CONNECTIONS recommendations across hardware generations and parallelism modes.
- Container deployment considerations and monitoring guidance.
- Use Case: Validate a new SLURM cluster by quickly spinning up a Megatron-LM job to test scaling and fault diagnosis.
Quick Start
Create and submit a SLURM sbatch script using the provided skeleton to start a multi-node Megatron-LM training job.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: mcore-run-on-slurm Download link: https://github.com/sayalinvidia/sayali-skills-test/archive/main.zip#mcore-run-on-slurm Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 510,000+ vetted skills library on demand.