Back to Glossary
Scaling

HorizontalPodAutoscaler(HPA)

A controller that automatically scales the replica count of a Deployment or StatefulSet based on observed metrics.

Also known as: HPA

What is HorizontalPodAutoscaler?

The HorizontalPodAutoscaler (HPA) is a Kubernetes controller that watches a target workload (Deployment, StatefulSet, or any resource with a /scale subresource) and adjusts its replica count to keep one or more metrics at a target value. The classic metric is average CPU utilization across all Pods (e.g., 'keep average CPU at 70%'): when traffic increases and CPU rises above target, HPA adds replicas; when traffic drops, HPA scales down to minReplicas.

HPA v2 (stable since Kubernetes 1.23) supports multiple metrics simultaneously: CPU, memory, custom metrics (from Prometheus via the custom.metrics.k8s.io API), and external metrics (from cloud provider metrics like SQS queue depth). Scale-down behavior is rate-limited by default — HPA waits a configurable stabilization window (300 seconds by default) before scaling down, preventing flapping. Scale-up is faster, with a 15-second stabilization window.

HPA and VPA should not both manage CPU/memory requests and replica counts simultaneously for the same workload, as they will fight each other. The recommended pattern is: use HPA for replica count based on CPU/memory usage metrics, and use VPA in recommendation-only mode to suggest right-sized resource requests that inform HPA's scaling behavior.

Example

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-api
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300

Cost & Waste Implications

Without HPA, Deployments are statically sized for peak load and run at that replica count 24/7. A service with 10 replicas at peak but 2 replicas at off-peak hours that runs 8 hours of peak and 16 hours of off-peak saves 53% of compute cost by scaling dynamically. HPA is one of the most impactful and low-risk cost optimization techniques available in Kubernetes.

KorPro— Kubernetes Cost Optimization

How KorPro Helps

KorPro identifies Deployments without HPAs that have variable CPU/memory utilization patterns, quantifying the estimated savings from implementing autoscaling versus running at static peak capacity.

Scan Your Cluster Free

Stop Wasting Money on Orphaned Kubernetes Resources

KorPro connects to your clusters across GCP, AWS, and Azure — no agents, no installation — and surfaces every orphaned resource with its monthly cost estimate.