babysit-training
CommunityAutonomously monitor ML training runs.
AuthorAlbatross679
Version1.0.0
Installs0
System Documentation
What problem does it solve?
Monitor and safeguard ongoing ML training jobs by performing comprehensive health checks and auto-restart from the last checkpoint on crash. Document any issues found in issues/ with proper frontmatter.
Core Features & Use Cases
- Continuous health checks: process alive, log progression, GPU utilization, disk space, checkpoint integrity, and W&B connectivity.
- Auto-restart from last checkpoint on crash and automatic issue documentation in issues/ with frontmatter.
- Periodic, loop-based monitoring for uninterrupted training visibility and rapid incident response.
Quick Start
Start monitoring by initiating the babysit-training loop to continuously observe training status and auto-recover when needed.
Dependency Matrix
Required Modules
None requiredComponents
references
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: babysit-training Download link: https://github.com/Albatross679/0313temp/archive/main.zip#babysit-training Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.