← Back to portfolio 2025-06-16

PagerDuty Fatigue: Why On-Call Needs AI, Not More Rules

PagerDutyOn-CallAISREIncident Management

In most on-call rotations, a significant portion of alerts turn out to be noise. The ratio of actionable to non-actionable alerts is one of the clearest indicators of operational health. When that ratio is poor, adding more suppression rules is not the answer.

The Problem with Rules

PagerDuty suppression rules are static: "Suppress alerts from service X during certain hours" or "Don't page for warning severity." But real incidents do not follow rules. A warning in the middle of the night that correlates with other warnings is an incident. A critical alert during a known maintenance window is noise.

As suppression rules accumulate, teams spend more time maintaining the rules than investigating the alerts they suppress.

What Teams Actually Need

A system that can: 1. Correlate multiple alerts into a single incident (related alerts produce one page, not many) 2. Enrich alerts with context (recent deployments, ongoing incidents, known issues) 3. Assess severity based on impact, not just threshold crossings 4. Route to the right person based on the affected service, not just a rotation schedule

This kind of intelligent alert management is what drives the design of agent-based incident response tools. A heuristic version groups alerts by service and time window, correlates with recent deployments, and suppresses duplicates. An LLM-enhanced version can reason about whether a pattern of alerts represents a new incident or a known issue.

The Human Factor

The biggest improvement is often cultural, not technical. A useful practice: every alert that pages someone but requires no action gets a "noise" tag. Monthly, the team reviews all noise-tagged alerts and either fixes the alert threshold, adds context to the runbook, or removes the alert entirely. This feedback loop steadily reduces alert fatigue over time.