After years of watching Kubernetes incidents play out the same way, the pattern becomes clear: alert fires, engineer wakes up, engineer reads logs, engineer identifies the problem, engineer runs the same kubectl command they ran last time. KubeHealer automates this entire loop.
The Architecture
KubeHealer is a Spring Boot service that runs inside a Kubernetes cluster. It polls Prometheus for metric anomalies and Elasticsearch for error patterns on a regular interval. When it detects something wrong, such as a pod OOMKilling, a deployment stuck in rollback, or a latency spike, it opens an incident.
The incident goes through a pipeline: signal collection, correlation (grouping related signals), historical lookup (has this happened before?), LLM reasoning (what is the root cause and what should the remediation be?), and finally, guarded remediation.
The Safety System
The remediation is the part that requires the most care. An agent that can scale deployments and restart pods can also cause serious cluster problems. KubeHealer's safety system has multiple layers: namespace allowlists, action allowlists, confidence thresholds, and scaling ceilings. Every action is logged with full context. The default mode is dry-run, meaning the agent reports what it would do without doing it.
Why Java, Not Python
Most AI agent frameworks are in Python. Java was chosen because: (1) the Fabric8 Kubernetes client is the best K8s SDK available, (2) Spring Boot's dependency injection makes the service layers clean and testable, (3) the platform engineering teams this targets already run Java services. A Java agent fits into their ecosystem without adding a Python runtime to their Docker images.
What Is Next
Richer LLM integration for complex reasoning, Terraform-based infrastructure remediation (not just pod-level actions), and a decision replay system that lets teams review the agent's reasoning after the fact.