troubleshoot-ml-k8s

Official

Diagnose broken ML inference deployments fast

Authorwikimedia
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill helps pinpoint why a Wikimedia KServe/Knative InferenceService deployment on ml-serve or ml-staging is failing, stuck, unreachable, not scaling, or returning bad responses.

Core Features & Use Cases

  • Evidence-driven Kubernetes triage: Guides you to gather the smallest set of facts (conditions, events, revisions, pods) to isolate the failing layer.
  • Symptom-to-layer branching: Uses targeted branches for missing/spec rejection, revision readiness issues, scheduling/persistence problems, crashes, readiness failures, ingress/routing failures, autoscaling/timeouts, and bad response payloads.
  • Cluster-safe workflow: Instructs you to run exact commands on deployment.eqiad.wmnet after entering the correct cluster context, without broad cluster sweeps.

Quick Start

Ask the Skill to diagnose an InferenceService failure by providing the cluster, namespace, and InferenceService name, then run the initial diagnostic bundle commands and paste the outputs back for analysis.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: troubleshoot-ml-k8s
Download link: https://github.com/wikimedia/machinelearning-liftwing-inference-services/archive/main.zip#troubleshoot-ml-k8s

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.