lumi-supercomputer

Community

Run and debug GPU ML jobs on LUMI.

Authordongzhuoyao
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Deploying and debugging GPU machine learning workloads on the LUMI supercomputer requires many system-specific adjustments that differ from typical NVIDIA clusters, including ROCm compatibility, Singularity container workflows, Lustre filesystem constraints, and network/CPU affinity tuning. This Skill collects the pragmatic defaults, environment variables, Slurm templates, and storage guidance needed to avoid hangs, poor performance, and failed jobs on LUMI.

Core Features & Use Cases

  • Slurm Job Templates: Ready single-node and multi-node sbatch templates with correct account, partition, GPU/GCD counts, CPU allocations, and safe flags like no-requeue and open-mode.
  • ROCm & Container Guidance: Clear instructions for using Singularity containers, recommended module loads, how to extend and squash user installs, and ROCm-specific considerations in place of CUDA.
  • Runtime Environment & Debugging: Explicit environment variables and mitigations for MIOpen cache on Lustre, NCCL/RCCL network settings for Slingshot, ROCR_VISIBLE_DEVICES mapping, and CPU-GPU affinity masks.
  • Storage and I/O Best Practices: Recommendations for using project/scratch/flash and SquashFS to avoid metadata storms on Lustre.
  • Use Case: Preparing a reproducible, high-performance PyTorch distributed training job on LUMI-G that avoids MIOpen hangs, ensures proper inter-node communication, and respects billing/accounting rules.

Quick Start

Use the lumi-supercomputer skill to generate a ready-to-run Slurm sbatch script and environment checklist for training a PyTorch model on LUMI with ROCm, Singularity, NCCL/RCCL settings, CPU affinity masks, and MIOpen cache redirection.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: lumi-supercomputer
Download link: https://github.com/dongzhuoyao/tao-research-skills/archive/main.zip#lumi-supercomputer

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.