talos-cluster-ops

Official

Diagnose and fix Talos Kubernetes issues fast

AuthorProbably-Group
Version1.0.0
Installs0

System Documentation

What problem does it solve?

It resolves slow, broken, or unstable Talos-based Kubernetes clusters by guiding investigation across hardware, Talos, Kubernetes, Cilium networking, Longhorn storage, GitOps drift, and security signals, then proposing remediation that requires explicit approval.

Core Features & Use Cases

  • Autonomous investigation workflows for pod crashes, node failures, networking symptoms, storage problems, and GitOps sync failures with structured, multi-phase evidence gathering.
  • Approval-gated remediation with explicit risk levels, rollback expectations, and optional snapshot requirements to reduce operational mistakes.
  • Cross-layer correlation using Talos health/logs, Kubernetes events/logs/metrics, Cilium/Hubble flow evidence, Longhorn replica health, ArgoCD drift checks, and security signals (Falco/Tetragon/Trivy/SPIRE).

Quick Start

Tell your AI to investigate a failing workload with: "Investigate why my pod rabbitmq-0 in production is crashing using talos-cluster-ops and propose the safest remediation I can approve."

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: talos-cluster-ops
Download link: https://github.com/Probably-Group/Dev-AID/archive/main.zip#talos-cluster-ops

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.