troubleshoot-ceph
OfficialDiagnose Ceph incidents fast with Netdata
System Documentation
What problem does it solve?
Troubleshoot Ceph failures by turning Netdata MCP health signals into a structured diagnosis path for common incident archetypes like capacity exhaustion, stuck placement groups, slow requests, MON quorum loss, and OSD flapping.
Core Features & Use Cases
- Operator-playbook-aligned triage: Guides an agent through a deterministic diagnostic tree matching Netdata’s Ceph playbook, so investigation stays consistent during on-call incidents.
- MCP-driven signal verification: Queries Netdata via MCP for Ceph health signals and validates remediation by re-running the same verification queries.
- Failure-mode coverage: Handles multiple recurring Ceph outage patterns and correlates them with host-level resource signals to avoid common misdiagnoses.
Use case example: An on-call engineer receives an alert that a Ceph cluster is experiencing elevated error rates and latency; use this skill to identify whether the root cause is nearfull throttling, stuck PGs from failed OSDs, MON quorum loss, request slowness, or OSD flapping, then recommend the smallest effective remediation and confirm the fix holds.
Quick Start
Tell the AI: "Diagnose my Ceph incident using Netdata MCP; check capacity exhaustion, stuck PGs, slow requests, MON quorum loss, and OSD flapping, then verify remediation by re-running the Ceph health queries."
Dependency Matrix
Required Modules
None requiredComponents
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: troubleshoot-ceph Download link: https://github.com/netdata/skills/archive/main.zip#troubleshoot-ceph Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.