trino-self-healing-platform

Community

Keep Trino running with autonomous self-healing.

Authorivanshamaev
Version1.0.0
Installs0

System Documentation

What problem does it solve?

It reduces on-call burden by autonomously detecting and recovering from common Trino cluster failures (hung queries, worker loss, OOM-killed queries) and by proactively maintaining query performance and Iceberg health.

Core Features & Use Cases

  • Hung Query Killer: Classifies queries by source (interactive vs batch) and kills those exceeding time thresholds.
  • Memory Pressure Relief: Detects critical memory pressure and terminates low-priority queries to restore stability.
  • Iceberg Auto-Maintenance: Triggers Iceberg compaction and stale statistics ANALYZE based on small-file and staleness signals.
  • RCA Generation: Produces structured incident root-cause analysis using a Claude-based workflow when alerts fire.
  • Use Case: When Prometheus/AlertManager signals elevated latency or query failures, run this skill to kill stuck queries, relieve memory pressure, compact/ANALYZE impacted Iceberg tables, and generate an RCA report for operators.

Quick Start

Ask the agent to enable the Trino self-healing watchdog and configure it to run every 15 minutes against your Trino coordinator, using AlertManager webhooks and Airflow scheduling.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: trino-self-healing-platform
Download link: https://github.com/ivanshamaev/de-agent-skills/archive/main.zip#trino-self-healing-platform

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.