offline-GPU-fault-diagnosis

Community

Diagnose offline GPU faults from logs.

AuthorlinfordWu
Version1.0.0
Installs0

System Documentation

What problem does it solve?

通过分析服务器离线日志(iBMC、OS Messages、InfoCollect)诊断离线 GPU 硬件故障、驱动异常、显存 ECC 错误及 PCIe 链路问题。当用户提供日志并询问 GPU 掉卡(Fallen off the bus)、XID 错误、显存不可纠正错误(Uncorrectable ECC)、GPU 维度过温或性能下降需要进行根因溯源时,调用本技能。

Core Features & Use Cases

  • 能够跨 iBMC/系统日志/InfoCollect 的多源日志聚合分析,定位根因并给出可验证的推断。
  • 支持对 GPU 掉卡、XID、ECC、超温/功耗等场景的根因推断与证据链构建,并给出修复建议。
  • 提供可复现的病例级示例和逐步排查路径,帮助运维在离线诊断时快速落地。

Quick Start

Provide a logs directory with ibmc_logs, messages, and infocollect_logs, and run the diagnosis to obtain a structured root-cause report.

Dependency Matrix

Required Modules

None required

Components

scriptsreferences

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: offline-GPU-fault-diagnosis
Download link: https://github.com/linfordWu/owls/archive/main.zip#offline-gpu-fault-diagnosis

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.