lsf_job_fail
CommunityPinpoint LSF job failures from exit and TERM codes
Software Engineering#oom#lsf#log triage#job failure diagnosis#exit code#term termination#hpc monitoring
Authorliyanqing1987
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This skill helps you diagnose why an LSF/OpenLava/Volclava job EXITed or failed by translating exit codes and TERM_* termination reasons into likely root causes, separating application errors from system kills.
Core Features & Use Cases
- Exit Code to Root-Cause Mapping: Interprets common exit code ranges (1–127 for application/script issues, >=128 for signal-driven termination) to guide the next investigation step.
- TERM_ Termination Reason Diagnosis*: Uses TERM_MEMLIMIT, TERM_CPULIMIT, TERM_RUNLIMIT, TERM_* resource limits, and admin/owner signals to identify what actually stopped the job.
- Actionable Investigation Workflow: Collects job details via bjobs/bhist, extracts required fields (exit code, TERM_*, peak usage vs limits, CWD, command, stdout/stderr), and outputs a structured diagnosis report with immediate remediation steps.
Quick Start
Ask: “I have an LSF job failure—what caused my job to EXIT? Please diagnose using the job’s exit code, TERM_* reason, and stdout/stderr logs.”
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: lsf_job_fail Download link: https://github.com/liyanqing1987/lsfMonitor/archive/main.zip#lsf-job-fail Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.