← Back to portfolio

The Case for AI-Driven Root Cause Analysis in Platform Engineering

AIRCAPlatform EngineeringLLMIncident Management

Root cause analysis is the most time-consuming part of incident response. An experienced SRE can look at metrics, logs, and deployment history and identify the root cause relatively quickly. A junior engineer takes much longer. An AI agent can do the same analysis in seconds, with important caveats.

What AI RCA Can Do Today

Given structured inputs (metrics, logs, deployment timeline), an LLM can:

  1. Correlate symptoms. "Memory usage spiked at the same time a new deployment rolled out" is a pattern an LLM identifies instantly.
  2. Suggest probable causes. Given the symptoms, the LLM generates a ranked list of likely causes based on the patterns it has seen in training data.
  3. Recommend remediation. For common failure modes (OOMKill, connection pool exhaustion, certificate expiry), the remediation is well-known. The LLM can suggest the right fix.

What AI RCA Cannot Do Today

  1. Novel failures. If the root cause is a kernel bug, a hardware failure, or a misconfigured CDN, the LLM will not identify it from application metrics alone.
  2. Business context. The LLM does not know that the traffic spike was caused by a marketing campaign, or that the database is slow because a batch job runs at a certain time.
  3. Certainty. Every LLM response is probabilistic. A confidence score below 1.0 means there is a chance the analysis is wrong. For incident response, that uncertainty matters.

The Hybrid Approach

The most effective design is a hybrid: deterministic heuristics handle known failure patterns (OOMKill, crash loops, connection exhaustion) with high confidence. The LLM handles ambiguous cases where heuristics cannot determine the cause. If the LLM's confidence is below the threshold, it escalates to a human.

This hybrid approach means the agent handles the majority of incidents autonomously and escalates the rest with context that reduces human investigation time significantly.