aiops-autonomous-incident-response
CommunityDiagnose incidents and start recovery fast.
Software Engineering#kubernetes#incident response#airflow#prometheus#aiops#rca generation#slack approvals
Authorivanshamaev
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill reduces MTTR by autonomously diagnosing data platform incidents from alert signals and assembling a structured draft for RCA and runbook actions.
Core Features & Use Cases
- Alert-to-diagnosis loop: classifies severity, then runs an LLM-driven diagnosis workflow using Kubernetes, Prometheus, Airflow metadata, and pod logs to confirm likely failure modes.
- Automated RCA draft generation: correlates logs/metrics/events into a failure taxonomy and produces an incident report suitable for postmortem follow-up.
- Self-healing with approval gate: executes only low-risk corrective actions automatically and routes medium/high-risk actions through Slack for human approval.
Quick Start
Use the aiops-autonomous-incident-response skill to handle an Alertmanager firing alert for your data platform and produce an incident report with recommended follow-up actions.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: aiops-autonomous-incident-response Download link: https://github.com/ivanshamaev/de-agent-skills/archive/main.zip#aiops-autonomous-incident-response Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.