triage-alerts

Community

Diagnose alert failures and fix faster.

Authordavid-driscoll
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill helps you rapidly triage operational incidents by correlating live uptime failures from Gatus with active alerting signals from Alertmanager, then translating those patterns into actionable root-cause hypotheses and remediation steps.

Core Features & Use Cases

  • Live failure discovery: Pulls current failing endpoint statuses from Gatus and active alerts from Alertmanager.
  • Root-cause patterning: Categorizes issues into likely causes (e.g., ingress/auth failures, DNS/webhook mismatches, rollout stalls, crash loops, secret sync problems, HelmRelease rollback loops).
  • Cluster-specific verification and remediation: Recommends kubeconfig-appropriate commands to inspect Helm/Kustomize health, pod/event state, and apply targeted fixes.
  • Common remediation playbooks: Includes concrete actions such as breaking stuck HelmRelease conditions, forcing ExternalSecret resync, triggering Flux reconciles, and restarting/rescheduling workloads.
  • Verification loop: Guides polling Gatus until failing endpoints recover to zero.

Quick Start

Run the triage process to fetch Gatus failing endpoints and Alertmanager active alerts, then follow the remediation steps until the failing count reaches 0.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: triage-alerts
Download link: https://github.com/david-driscoll/home-operations/archive/main.zip#triage-alerts

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.