Name: troubleshoot-ml-k8s
Availability: InStock
Author: wikimedia

System Documentation

What problem does it solve?

This Skill helps pinpoint why a Wikimedia KServe/Knative InferenceService deployment on ml-serve or ml-staging is failing, stuck, unreachable, not scaling, or returning bad responses.

Core Features & Use Cases

Evidence-driven Kubernetes triage: Guides you to gather the smallest set of facts (conditions, events, revisions, pods) to isolate the failing layer.
Symptom-to-layer branching: Uses targeted branches for missing/spec rejection, revision readiness issues, scheduling/persistence problems, crashes, readiness failures, ingress/routing failures, autoscaling/timeouts, and bad response payloads.
Cluster-safe workflow: Instructs you to run exact commands on deployment.eqiad.wmnet after entering the correct cluster context, without broad cluster sweeps.

Quick Start

Ask the Skill to diagnose an InferenceService failure by providing the cluster, namespace, and InferenceService name, then run the initial diagnostic bundle commands and paste the outputs back for analysis.

Please help me install this Skill: Name: troubleshoot-ml-k8s Download link: https://github.com/wikimedia/machinelearning-liftwing-inference-services/archive/main.zip#troubleshoot-ml-k8s Please download this .zip file, extract it, and install it in the .claude/skills/ directory.

troubleshoot-ml-k8s

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper