← Back to portfolio 2025-07-24

Flaky Tests Are a Culture Problem, Not a Code Problem

TestingCI/CDEngineering CultureQuality

Flaky tests that have been known about for months but never fixed are a symptom of a culture problem, not a code problem. When the pipeline has a retry mechanism, the retry hides the flakiness, and the flakiness hides real bugs.

How Flaky Tests Kill Quality

A flaky test that fails a small percentage of the time and retries means that very rarely all retries fail on the same test. That feels rare. But with many builds per day, it happens multiple times a week. Engineers learn to "just re-run the pipeline." This creates a culture where test failures are acceptable.

Then a real bug introduces a real test failure. The engineer sees the failure, assumes it is flaky, re-runs the pipeline. The second run passes because the flaky test passed this time and the real failure was in a different test that the engineer did not notice. The bug ships to production.

The Fix

Eliminate retries. If a test fails, the pipeline fails. Period.

The first week is painful. Flaky tests fail intermittently, blocking merges. Engineers are frustrated. But within weeks, all flaky tests get fixed. Some have timing dependencies (replaced with Awaitility). Some have shared mutable state (fixed with test isolation). Some are actually testing the wrong thing (deleted).

The Culture Change

A good rule: if a test fails more than twice in a week without a code change, it gets quarantined into a separate "unstable" suite that runs nightly but does not block merges. The test owner gets a ticket with a short SLA. If it is not fixed in time, it gets deleted.

Zero tolerance for flaky tests means zero tolerance for ignoring failures. That is a culture shift, not a code fix.