Scaling

HorizontalPodAutoscaler(HPA)

A controller that automatically scales the replica count of a Deployment or StatefulSet based on observed metrics.

Also known as: HPA

What is HorizontalPodAutoscaler?

The HorizontalPodAutoscaler (HPA) is a Kubernetes controller that watches a target workload (Deployment, StatefulSet, or any resource with a /scale subresource) and adjusts its replica count to keep one or more metrics at a target value. The classic metric is average CPU utilization across all Pods (e.g., 'keep average CPU at 70%'): when traffic increases and CPU rises above target, HPA adds replicas; when traffic drops, HPA scales down to minReplicas.

HPA v2 (stable since Kubernetes 1.23) supports multiple metrics simultaneously: CPU, memory, custom metrics (from Prometheus via the custom.metrics.k8s.io API), and external metrics (from cloud provider metrics like SQS queue depth). Scale-down behavior is rate-limited by default — HPA waits a configurable stabilization window (300 seconds by default) before scaling down, preventing flapping. Scale-up is faster, with a 15-second stabilization window.

HPA and VPA should not both manage CPU/memory requests and replica counts simultaneously for the same workload, as they will fight each other. The recommended pattern is: use HPA for replica count based on CPU/memory usage metrics, and use VPA in recommendation-only mode to suggest right-sized resource requests that inform HPA's scaling behavior.

Example

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-api
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300

Cost & Waste Implications

Without HPA, Deployments are statically sized for peak load and run at that replica count 24/7. A service with 10 replicas at peak but 2 replicas at off-peak hours that runs 8 hours of peak and 16 hours of off-peak saves 53% of compute cost by scaling dynamically. HPA is one of the most impactful and low-risk cost optimization techniques available in Kubernetes.

KorPro— Kubernetes Cost Optimization

How KorPro Helps

KorPro identifies Deployments without HPAs that have variable CPU/memory utilization patterns, quantifying the estimated savings from implementing autoscaling versus running at static peak capacity.

Scan Your Cluster Free

Related Terms

Deployment

Workloads

A controller that manages a ReplicaSet to keep a specified number of identical Pod replicas running and handles rolling updates.

Read definition

VerticalPodAutoscaler(VPA)

Scaling

A controller that recommends or automatically adjusts CPU and memory resource requests for Pods based on observed usage.

Read definition

Cluster Autoscaler

Scaling

A component that automatically adds nodes when Pods are unschedulable and removes nodes when they are underutilized.

Read definition

Resource Requests and Limits

Configuration

Per-container declarations of guaranteed CPU/memory (requests) and hard maximums (limits) that drive scheduling and enforcement.

Read definition

Stop Wasting Money on Orphaned Kubernetes Resources

KorPro connects to your clusters across GCP, AWS, and Azure — no agents, no installation — and surfaces every orphaned resource with its monthly cost estimate.

Get Started Free Contact Sales