snellius-supercomputer
CommunityRun GPU jobs reliably on Snellius
Authordongzhuoyao
Version1.0.0
Installs0
System Documentation
What problem does it solve?
Provides concise, operational guidance to submit and troubleshoot GPU and CPU workloads on the SURF Snellius supercomputer, reducing time lost to environment misconfiguration, incorrect SLURM flags, and storage quota surprises.
Core Features & Use Cases
- Environment setup: Clear patterns for conda and venv initialization and when to use environment modules to ensure CUDA and Python compatibility.
- Job submission & partitioning: Partition selection guidance, single-node sbatch templates, interactive srun examples, and account/quotas reminders to avoid rejected jobs.
- Monitoring & triage: Common monitoring commands, failure causes (CUDA availability, OOM, permission issues), and recommended fixes to recover or resubmit jobs.
- Use case: Prepare a reproducible training run by copying data to scratch, activating a conda environment, selecting the gpu_h100 partition for large models, and submitting an sbatch script with the correct --account.
Quick Start
Connect to snellius.surf.nl, load the appropriate module year and CUDA module, source conda or activate your venv, and submit your training script with sbatch using the correct --account and partition.
Dependency Matrix
Required Modules
None requiredComponents
references
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: snellius-supercomputer Download link: https://github.com/dongzhuoyao/tao-research-skills/archive/main.zip#snellius-supercomputer Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.