run-on-slurm

Name: run-on-slurm
Availability: InStock
Author: yo-steven

Community

Launch Megatron-LM multi-node jobs on SLURM

Software Engineering #slurm #sbatch #megatrflm #torch distributed run #multi-node training #nccl debugging #cuda_device_max_connections

Authoryo-steven

Version1.0.0

Installs0

System Documentation

What problem does it solve?

This Skill solves the problem of reliably running distributed Megatron-LM training across multiple GPUs and nodes on a SLURM cluster without misconfigured environment variables, device connectivity, or logging/diagnostics.

Core Features & Use Cases

SLURM job skeleton for multi-node training: Provides a minimal sbatch pattern that computes MASTER_ADDR, WORLD_SIZE, and uses srun with one task per node.
Correct torch.distributed.run wiring: Sets nnodes, nproc-per-node, node-rank, master-addr, and master-port so ranks rendezvous correctly.
CUDA_DEVICE_MAX_CONNECTIONS rules by hardware/parallelism: Prevents known failures by advising when to set it to 1, when to avoid it, and when to use 32 for MoE comm overlap.
Failure diagnosis guidance: Instructs how to inspect rank-by-rank stderr and classify OOM, shape/divisibility, import errors, and NCCL failures.

Quick Start

Submit the job by saving the provided sbatch skeleton as run_megatron.slurm, then run sbatch --parsable run_megatron.slurm from the shared Megatron worktree.

run-on-slurm

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper