← Back to portfolio

IncidentBrain: Teaching an Agent to Think Like an SRE

IncidentBrainOpen SourceSREAIIncident Management

IncidentBrain is an AI agent that correlates PagerDuty alerts and application events, enriches them with Elasticsearch logs and Grafana metrics, performs root cause analysis, and suggests remediation. Here is how to design an agent that thinks like an SRE.

The SRE Mental Model

When an SRE gets paged, they follow a mental decision tree:

  1. Is this a new incident or a duplicate of something already being investigated?
  2. What changed recently? (deployment, config change, traffic spike)
  3. Which services are affected? (direct and downstream)
  4. What is the probable root cause?
  5. What is the fastest way to mitigate?

IncidentBrain encodes this exact workflow in code.

Signal Normalization

PagerDuty alerts and application events have completely different schemas. IncidentBrain normalizes both into a NormalizedSignal with: signal ID, service name, environment, severity, summary, and timestamp. This normalization is the foundation. Everything downstream works on the same data model regardless of source.

Fingerprint-Based Correlation

The agent computes a deterministic fingerprint from each signal's service name, environment, and error pattern. Two signals with the same fingerprint are correlated into the same incident. This handles the problem of multiple PagerDuty alerts for the same issue. The SRE sees one incident, not many.

The RCA Engine

The RCA engine has two modes: heuristic and LLM-enhanced. The heuristic mode matches signal patterns against known failure templates (OOMKill, connection exhaustion, certificate expiry). The LLM mode sends the enriched incident context to an OpenAI-compatible endpoint and parses the structured response.

Both modes return the same output: root cause summary, confidence score, affected services, timeline, and remediation suggestions. The calling code does not know or care which mode produced the analysis.

The Safety Layer

IncidentBrain can suggest remediation but requires explicit opt-in for auto-execution. The safety gates: a global toggle, an action toggle, an action allowlist, and dry-run mode. Multiple independent switches, all defaulting to off.