babysit-training

Community

Autonomously monitor ML training runs.

AuthorAlbatross679
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Monitor and safeguard ongoing ML training jobs by performing comprehensive health checks and auto-restart from the last checkpoint on crash. Document any issues found in issues/ with proper frontmatter.

Core Features & Use Cases

  • Continuous health checks: process alive, log progression, GPU utilization, disk space, checkpoint integrity, and W&B connectivity.
  • Auto-restart from last checkpoint on crash and automatic issue documentation in issues/ with frontmatter.
  • Periodic, loop-based monitoring for uninterrupted training visibility and rapid incident response.

Quick Start

Start monitoring by initiating the babysit-training loop to continuously observe training status and auto-recover when needed.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: babysit-training
Download link: https://github.com/Albatross679/0313temp/archive/main.zip#babysit-training

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.