dataops-self-healing-platform
CommunityAutomatically heal broken data pipelines
Authorivanshamaev
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill reduces on-call workload by automatically recovering from common data platform failures before they escalate into prolonged data downtime or missing data.
Core Features & Use Cases
- Self-healing Airflow orchestration: Auto-restart failed DAG runs with guardrails for idempotency and a circuit breaker to prevent retry storms.
- Kafka consumer auto-recovery: Detect consumer lag/partition rebalance issues and apply automated actions such as scale-up, offset reset, or alerting.
- Data reliability protections: Auto-quarantine partitions that fail DQ gates, detect missing partitions, and trigger automated backfill; include watchdog heartbeat monitoring and OOM-based Kubernetes scaling.
Quick Start
Ask an agent to set up the self_healing_watchdog workflow to auto-restart eligible failed Airflow DAG runs and quarantine partitions that fail DQ checks while monitoring Kafka lag and pipeline heartbeats.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: dataops-self-healing-platform Download link: https://github.com/ivanshamaev/de-agent-skills/archive/main.zip#dataops-self-healing-platform Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.