job-guardian
CommunityKeep jobs alive and fail loudly.
Software Engineering#batch processing#failure triage#job monitoring#checkpoint resume#remote training#progress detection#pod management
Authorbtseytlin
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill prevents long-running jobs from silently failing while you’re away, by enforcing evidence-based liveness and progress checks plus safe, reversible recovery.
Core Features & Use Cases
- Launch contract + crash-gating: Forces a written job contract before launch and confirms real health/progress before committing to long polling.
- Stability polling with evidence: Detects hangs by verifying progress changes (e.g., log growth, step/throughput movement) rather than assuming “no error” means “running fine.”
- Recoverable vs. unrecoverable triage: Applies only predefined recovery playbooks with attempt caps, otherwise stops, tears down safely, and notifies.
Quick Start
Give the AI a launch-and-watch request for a batch job, and it will write a guarding contract, start the job, poll every 270s for progress, and either recover or stop-and-notify with the evidence log.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: job-guardian Download link: https://github.com/btseytlin/ultrapack/archive/main.zip#job-guardian Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.