Back to Blog
Right-Sizing

P95 + Headroom: How to Right-Size Kubernetes Without Throttling Workloads

Right-sizing on average utilization is how teams accidentally cause throttling and OOMKills. This is the P95-plus-headroom methodology — how to set requests and limits from real usage, the difference between the two, and the kubectl patches to apply it safely.

KorPro Team
June 24, 2026
7 min read
Right-SizingKubernetesResource ManagementP95OOMKillThrottlingCost Optimization

The fastest way to cause a production incident while trying to save money is to right-size Kubernetes workloads on average utilization. It looks responsible — usage is low, so requests come down, and the cost dashboard improves. Then the next traffic burst hits, the workload gets throttled or OOMKilled, and the on-call engineer spends the savings on a postmortem.

Right-sizing done well is one of the largest and safest sources of Kubernetes savings, because most clusters are dramatically over-provisioned: requests are set once, copied from a template, and never revisited. But the methodology matters. This post covers why average-based sizing is dangerous, how to size on P95 plus headroom instead, the real difference between requests and limits, and the kubectl patches to apply it without throttling your workloads.

Why Averages Lie

Consider a pod that averages 200m of CPU over a day. That number feels like a clean target for the request. But the average is a flat summary of a spiky reality: the workload might idle at 50m for most of the day and burst to 900m during request floods, batch processing, or garbage collection pauses.

If you set the CPU request to 200m, the scheduler reserves only that much. During the burst, the workload competes for CPU it was never guaranteed, the kubelet throttles it against its limit, and latency climbs at exactly the worst moment. Memory is less forgiving still: there is no "throttling" for memory. A workload that averages 400 MB but peaks at 1.2 GB during a large request will be OOMKilled the instant it crosses its memory limit — the container is terminated, restarted, and may crash-loop under sustained load.

Averages optimize for the common case and ignore the case that actually causes incidents. That is exactly backwards for capacity planning.

The P95 + Headroom Method

P95 — the 95th percentile of observed usage — is a better basis. It captures sustained real demand (the load present 95% of the time) while ignoring the rare, one-off spikes that would otherwise force you to over-provision permanently. You then add headroom on top: a buffer for normal variance, brief spikes above P95, and modest organic growth.

The method, per workload:

  1. Collect usage over a representative window — at least one full traffic cycle, ideally 1-2 weeks, so weekly peaks and batch jobs are included.
  2. Compute P95 of CPU and memory from that window.
  3. Add headroom. Start around 15-25% over P95 for memory and 20-30% for CPU. Give bursty, latency-sensitive services more; give steady batch jobs less.
  4. Set requests at P95 + headroom. This is the value the scheduler reserves.
  5. Set limits deliberately based on burst behavior (see below) — not by blindly multiplying the request.
  6. Re-measure after the change and tune. Right-sizing is a loop, not a one-time edit.

The reason P95 plus headroom is safe where averages are dangerous: you are reserving capacity for the load that is almost always present, plus a cushion, while declining to pay for the rare spike that P99/max would force you to provision for full-time.

Requests vs. Limits — Get This Right

This is the distinction that determines whether right-sizing saves money or causes outages.

Requests are the guaranteed floor. The scheduler uses requests to decide which node a pod lands on and reserves that capacity. Set requests too high and you waste money — the node is "full" of reservations that nothing uses, so the autoscaler adds nodes you didn't need. Set them too low and pods get scheduled onto nodes that can't actually serve their real demand.

Limits are the hard ceiling enforced at runtime, and the two resources behave very differently at the ceiling:

  • CPU is compressible. Exceed the CPU limit and the workload is throttled — slowed down, not killed. A too-tight CPU limit silently adds latency.
  • Memory is incompressible. Exceed the memory limit and the container is OOMKilled — terminated immediately. A too-tight memory limit causes crashes and restart loops.

Practical guidance that follows from this:

  • Always set memory limits, and set them with enough headroom above P95 that normal peaks don't trip them — because the failure mode is a hard kill that can cascade across a node.
  • Be cautious with CPU limits. Set CPU requests accurately for fair scheduling, and either set generous CPU limits or omit them on trusted workloads. An aggressive CPU limit is a common, hard-to-diagnose source of latency.

Applying It with kubectl

Once you have P95-plus-headroom targets, apply them with a patch. Suppose a deployment currently requests 500m CPU / 1Gi memory, but two weeks of data show P95 at 240m CPU and 560 MB memory. With ~25% headroom that gives roughly 300m CPU and 700Mi memory:

bash
kubectl patch deployment api -n production --type='strategic' -p '{ "spec": { "template": { "spec": { "containers": [{ "name": "api", "resources": { "requests": { "cpu": "300m", "memory": "700Mi" }, "limits": { "memory": "1Gi" } } }] } } } }'

Note the deliberate choices: the memory limit sits comfortably above the memory request to absorb spikes without OOMKilling, and there is no CPU limit, letting the workload burst above its request when the node has spare capacity rather than throttling it.

Roll out one workload at a time and watch it for a full traffic cycle. The signals that tell you whether the sizing held:

bash
# Look for OOMKills and restarts after the change kubectl get pods -n production -o wide kubectl get events -n production --field-selector reason=OOMKilling # Watch live usage against the new requests/limits kubectl top pods -n production

If you see OOMKills, your memory limit (and likely request) is too tight — raise the headroom. If latency degrades without OOMKills, you are likely CPU-throttled — loosen or remove the CPU limit. If usage sits comfortably below the new requests with no restarts, you can tighten further on the next pass.

For workloads where you'd rather not hand-tune, the Vertical Pod Autoscaler can recommend (or apply) requests based on observed usage — but treat its recommendations as a starting point and still apply the headroom and requests-vs-limits judgment above.

Right-Sizing Is the Start, Not the Whole Job

Right-sizing pods recovers the over-provisioned capacity inside the cluster, and it pairs naturally with finding workloads that aren't doing any work at all — idle deployments, orphaned resources that keep billing, and oversized PVCs. It also pairs with the other half of your bill: the managed databases, caches, and log pipelines around the cluster are over-provisioned for the same reasons, and the P95-plus-headroom discipline applies to them too.

Set requests from real P95 usage, add deliberate headroom, treat memory limits and CPU limits as the different tools they are, and roll changes out one workload at a time. Done that way, right-sizing is where a large share of Kubernetes savings comes from — without a single throttled request or OOMKill.


Right-Size Your Cluster from Real Usage

Want to see your over-provisioned workloads ranked by potential savings, computed from real P95 usage with safe headroom built in? Create a free KorPro account to get right-sizing recommendations across every namespace in minutes. Prefer a guided look? Contact our team for a walkthrough.

Stop Wasting Kubernetes Resources

Ready to Clean Up Your Clusters?

KorPro automatically detects unused resources, orphaned secrets, and wasted spend across all your Kubernetes clusters. Start optimizing in minutes.

Written by

KorPro Team

View All Posts