← Back to portfolio 2025-01-14

Kubernetes Resource Limits: The Silent Production Killer

KubernetesOOMKillResource ManagementProduction

A surprising number of production incidents in Kubernetes environments trace back to resource limits. Not missing limits: wrong limits. Here is what the failure patterns look like and how to set limits correctly.

The OOMKill Pattern

A Java service running with a JVM heap set to a certain size and a container memory limit set to the same value will OOMKill. The JVM heap is only part of the picture. The JVM also needs memory for metaspace, thread stacks, native memory, and garbage collection overhead. Total JVM memory is typically 1.5-2x the heap size. Container limits should be set to at least 2x the configured heap.

The CPU Throttle Pattern

Setting CPU limits that seem generous relative to normal usage can still cause problems. During garbage collection, the JVM needs burst CPU capacity. GC pauses that should take milliseconds can take hundreds of milliseconds when the kernel throttles the container. Response times spike periodically in ways that are hard to diagnose.

The fix: remove CPU limits entirely. Use CPU requests for scheduling, but let containers burst. Kubernetes nodes generally have enough capacity for occasional spikes. CPU throttling causes more production incidents than CPU overcommitment.

The Right Defaults

After extensive testing, a solid default resource template for Java services:

Memory request: 1.5x JVM heap
Memory limit: 2x JVM heap
CPU request: Based on measured average utilization, not guesses
CPU limit: None (removed)

A Kubernetes admission webhook that warns (not blocks) when a deployment sets CPU limits is a good safety net. The warning should include documentation explaining why CPU limits are harmful for JVM workloads.