agent-health-monitoring
OfficialKeep multi-agent systems healthy in production
Software Engineering#multi-agent#observability#incident response#alerting#anomaly detection#agent monitoring#latency metrics
Authorcosmicstack-labs
Version1.0.0
Installs0
System Documentation
What problem does it solve?
Production multi-agent systems can fail silently—agents may become unresponsive, loop, degrade latency, or produce errors without triggering standard infrastructure alerts.
Core Features & Use Cases
- Agent vital-sign monitoring: Tracks response rate, latency (P50/P95/P99), error rate, step count, tool success rate, token consumption, context pressure, and hallucination/quality proxies.
- Liveness, readiness, and deep checks: Verifies the agent is alive, ready to accept work, and capable of producing valid outputs via diagnostic test tasks.
- Anomaly detection and severity-based alerting: Detects statistically unusual behavior and routes incidents by P0–P3 severity to the appropriate channels.
- Operational dashboards and runbooks: Provides the panel set and incident playbooks needed to investigate and recover quickly (unresponsive agents, error spikes, token budget anomalies).
Quick Start
Ask your agent to “Check agent health for all agents, detect anomalies from the last metrics window, and generate the monitoring dashboard plus the recommended incident response steps for any P0/P1 alerts.”
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: agent-health-monitoring Download link: https://github.com/cosmicstack-labs/mercury-agent-skills/archive/main.zip#agent-health-monitoring Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.