troubleshoot-ml-k8s
OfficialDiagnose broken ML inference deployments fast
Software Engineering#autoscaling#knative#gpu inference#kserve#deployment diagnostics#kubernetes troubleshooting#inferenceservice
Authorwikimedia
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill helps pinpoint why a Wikimedia KServe/Knative InferenceService deployment on ml-serve or ml-staging is failing, stuck, unreachable, not scaling, or returning bad responses.
Core Features & Use Cases
- Evidence-driven Kubernetes triage: Guides you to gather the smallest set of facts (conditions, events, revisions, pods) to isolate the failing layer.
- Symptom-to-layer branching: Uses targeted branches for missing/spec rejection, revision readiness issues, scheduling/persistence problems, crashes, readiness failures, ingress/routing failures, autoscaling/timeouts, and bad response payloads.
- Cluster-safe workflow: Instructs you to run exact commands on deployment.eqiad.wmnet after entering the correct cluster context, without broad cluster sweeps.
Quick Start
Ask the Skill to diagnose an InferenceService failure by providing the cluster, namespace, and InferenceService name, then run the initial diagnostic bundle commands and paste the outputs back for analysis.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: troubleshoot-ml-k8s Download link: https://github.com/wikimedia/machinelearning-liftwing-inference-services/archive/main.zip#troubleshoot-ml-k8s Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.