lsf_job_fail

Community

Pinpoint LSF job failures from exit and TERM codes

Authorliyanqing1987
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This skill helps you diagnose why an LSF/OpenLava/Volclava job EXITed or failed by translating exit codes and TERM_* termination reasons into likely root causes, separating application errors from system kills.

Core Features & Use Cases

  • Exit Code to Root-Cause Mapping: Interprets common exit code ranges (1–127 for application/script issues, >=128 for signal-driven termination) to guide the next investigation step.
  • TERM_ Termination Reason Diagnosis*: Uses TERM_MEMLIMIT, TERM_CPULIMIT, TERM_RUNLIMIT, TERM_* resource limits, and admin/owner signals to identify what actually stopped the job.
  • Actionable Investigation Workflow: Collects job details via bjobs/bhist, extracts required fields (exit code, TERM_*, peak usage vs limits, CWD, command, stdout/stderr), and outputs a structured diagnosis report with immediate remediation steps.

Quick Start

Ask: “I have an LSF job failure—what caused my job to EXIT? Please diagnose using the job’s exit code, TERM_* reason, and stdout/stderr logs.”

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: lsf_job_fail
Download link: https://github.com/liyanqing1987/lsfMonitor/archive/main.zip#lsf-job-fail

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.