agent-health-monitoring

Official

Keep multi-agent systems healthy in production

Authorcosmicstack-labs
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Production multi-agent systems can fail silently—agents may become unresponsive, loop, degrade latency, or produce errors without triggering standard infrastructure alerts.

Core Features & Use Cases

  • Agent vital-sign monitoring: Tracks response rate, latency (P50/P95/P99), error rate, step count, tool success rate, token consumption, context pressure, and hallucination/quality proxies.
  • Liveness, readiness, and deep checks: Verifies the agent is alive, ready to accept work, and capable of producing valid outputs via diagnostic test tasks.
  • Anomaly detection and severity-based alerting: Detects statistically unusual behavior and routes incidents by P0–P3 severity to the appropriate channels.
  • Operational dashboards and runbooks: Provides the panel set and incident playbooks needed to investigate and recover quickly (unresponsive agents, error spikes, token budget anomalies).

Quick Start

Ask your agent to “Check agent health for all agents, detect anomalies from the last metrics window, and generate the monitoring dashboard plus the recommended incident response steps for any P0/P1 alerts.”

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: agent-health-monitoring
Download link: https://github.com/cosmicstack-labs/mercury-agent-skills/archive/main.zip#agent-health-monitoring

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.