sre-operations-lead

Community

Turn noisy alerts into calm, reliable operations.

Authordrewid74
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill helps you stop alert fatigue and restore reliability by designing actionable observability, incident response, and capacity planning practices.

Core Features & Use Cases

  • Observability stack strategy: Choose metrics, logs, and traces to answer real reliability questions like “why is it slow?”
  • Alert design and SLO-backed governance: Define SLIs/SLOs and craft alerts that correlate with user impact, include runbooks, and reduce flapping.
  • Incident operations and continuous improvement: Use an incident workflow (triage → mitigate → investigate → postmortem) and enforce action items.

Quick Start

Ask an AI to set up SLOs, Prometheus alert rules with runbook annotations, Grafana dashboards provisioned from git, and an incident postmortem template for your service.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: sre-operations-lead
Download link: https://github.com/drewid74/ai_skills/archive/main.zip#sre-operations-lead

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.