slurm-self-healing-watchdog

Community

Watch Slurm jobs, repair failures, retry safely

AuthorKwongFuk
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Slurm job workflows can stall or fail silently, requiring manual monitoring. This skill provides an automated watchdog that submits sbatch jobs, watches terminal states via sacct/squeue, diagnoses failures from logs, and triggers minimal, local repairs before resubmitting up to a configurable limit.

Core Features & Use Cases

  • Submits an sbatch script and continuously monitors its status until terminal.
  • Diagnoses failures using Slurm state and log output, then applies a minimal repair workflow.
  • Automatically retries failed jobs up to a configurable limit and emits per-attempt JSON reports.
  • Ideal for long-running pipelines and experiments where manual re-submission is error-prone.

Quick Start

From the target repository root, run the watchdog with the sbatch script path and optional --max-attempts to start automatic monitoring and safe repairs.

Dependency Matrix

Required Modules

None required

Components

scripts

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: slurm-self-healing-watchdog
Download link: https://github.com/KwongFuk/codex-skills/archive/main.zip#slurm-self-healing-watchdog

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.