lumi-supercomputer
CommunityRun and debug GPU ML jobs on LUMI.
System Documentation
What problem does it solve?
Deploying and debugging GPU machine learning workloads on the LUMI supercomputer requires many system-specific adjustments that differ from typical NVIDIA clusters, including ROCm compatibility, Singularity container workflows, Lustre filesystem constraints, and network/CPU affinity tuning. This Skill collects the pragmatic defaults, environment variables, Slurm templates, and storage guidance needed to avoid hangs, poor performance, and failed jobs on LUMI.
Core Features & Use Cases
- Slurm Job Templates: Ready single-node and multi-node sbatch templates with correct account, partition, GPU/GCD counts, CPU allocations, and safe flags like no-requeue and open-mode.
- ROCm & Container Guidance: Clear instructions for using Singularity containers, recommended module loads, how to extend and squash user installs, and ROCm-specific considerations in place of CUDA.
- Runtime Environment & Debugging: Explicit environment variables and mitigations for MIOpen cache on Lustre, NCCL/RCCL network settings for Slingshot, ROCR_VISIBLE_DEVICES mapping, and CPU-GPU affinity masks.
- Storage and I/O Best Practices: Recommendations for using project/scratch/flash and SquashFS to avoid metadata storms on Lustre.
- Use Case: Preparing a reproducible, high-performance PyTorch distributed training job on LUMI-G that avoids MIOpen hangs, ensures proper inter-node communication, and respects billing/accounting rules.
Quick Start
Use the lumi-supercomputer skill to generate a ready-to-run Slurm sbatch script and environment checklist for training a PyTorch model on LUMI with ROCm, Singularity, NCCL/RCCL settings, CPU affinity masks, and MIOpen cache redirection.
Dependency Matrix
Required Modules
None requiredComponents
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: lumi-supercomputer Download link: https://github.com/dongzhuoyao/tao-research-skills/archive/main.zip#lumi-supercomputer Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.