snellius-supercomputer

Community

Run GPU jobs reliably on Snellius

Authordongzhuoyao
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Provides concise, operational guidance to submit and troubleshoot GPU and CPU workloads on the SURF Snellius supercomputer, reducing time lost to environment misconfiguration, incorrect SLURM flags, and storage quota surprises.

Core Features & Use Cases

  • Environment setup: Clear patterns for conda and venv initialization and when to use environment modules to ensure CUDA and Python compatibility.
  • Job submission & partitioning: Partition selection guidance, single-node sbatch templates, interactive srun examples, and account/quotas reminders to avoid rejected jobs.
  • Monitoring & triage: Common monitoring commands, failure causes (CUDA availability, OOM, permission issues), and recommended fixes to recover or resubmit jobs.
  • Use case: Prepare a reproducible training run by copying data to scratch, activating a conda environment, selecting the gpu_h100 partition for large models, and submitting an sbatch script with the correct --account.

Quick Start

Connect to snellius.surf.nl, load the appropriate module year and CUDA module, source conda or activate your venv, and submit your training script with sbatch using the correct --account and partition.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: snellius-supercomputer
Download link: https://github.com/dongzhuoyao/tao-research-skills/archive/main.zip#snellius-supercomputer

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.