slurm-self-healing-watchdog
CommunityWatch Slurm jobs, repair failures, retry safely
AuthorKwongFuk
Version1.0.0
Installs0
System Documentation
What problem does it solve?
Slurm job workflows can stall or fail silently, requiring manual monitoring. This skill provides an automated watchdog that submits sbatch jobs, watches terminal states via sacct/squeue, diagnoses failures from logs, and triggers minimal, local repairs before resubmitting up to a configurable limit.
Core Features & Use Cases
- Submits an sbatch script and continuously monitors its status until terminal.
- Diagnoses failures using Slurm state and log output, then applies a minimal repair workflow.
- Automatically retries failed jobs up to a configurable limit and emits per-attempt JSON reports.
- Ideal for long-running pipelines and experiments where manual re-submission is error-prone.
Quick Start
From the target repository root, run the watchdog with the sbatch script path and optional --max-attempts to start automatic monitoring and safe repairs.
Dependency Matrix
Required Modules
None requiredComponents
scripts
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: slurm-self-healing-watchdog Download link: https://github.com/KwongFuk/codex-skills/archive/main.zip#slurm-self-healing-watchdog Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.