sre-operations-lead
CommunityTurn noisy alerts into calm, reliable operations.
Authordrewid74
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill helps you stop alert fatigue and restore reliability by designing actionable observability, incident response, and capacity planning practices.
Core Features & Use Cases
- Observability stack strategy: Choose metrics, logs, and traces to answer real reliability questions like “why is it slow?”
- Alert design and SLO-backed governance: Define SLIs/SLOs and craft alerts that correlate with user impact, include runbooks, and reduce flapping.
- Incident operations and continuous improvement: Use an incident workflow (triage → mitigate → investigate → postmortem) and enforce action items.
Quick Start
Ask an AI to set up SLOs, Prometheus alert rules with runbook annotations, Grafana dashboards provisioned from git, and an incident postmortem template for your service.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: sre-operations-lead Download link: https://github.com/drewid74/ai_skills/archive/main.zip#sre-operations-lead Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.