Kubernetes Pod Scaling: Horizontal Pod Autoscaler

HPA watches a deployment and adjusts its replica count based on observed CPU utilization. The utilization is always a ratio: actual CPU usage divided by requests.cpu. Without requests set, HPA has no baseline to compare against and reports <unknown> - it cannot calculate or act on anything.

Task 1 - HPA Without CPU Requests

Deploy a workload without CPU requests and attach an HPA to it. Observe what happens.

Steps:

Create Deployment named web, image registry.k8s.io/hpa-example, 1 replica
Expose the deployment as a ClusterIP Service on port 80
Create an HPA for web: min 1 replica, max 5, target CPU utilization 50%

kubectl get hpa web

The TARGETS column shows <unknown>/50%. The HPA exists and knows its target - but it cannot calculate current utilization because there is no requests.cpu to divide against.

Hint: Create HPA imperatively

kubectl expose deployment web --port=80
kubectl autoscale deployment web --min=1 --max=5 --cpu=50%

Hint: Add behavior section via dry-run

Generate the HPA manifest, add the behavior section, then apply:

kubectl autoscale deployment web --min=1 --max=5 --cpu=50% --dry-run=client -o yaml > hpa.yaml

Add under spec:

  behavior:
    scaleDown:
      stabilizationWindowSeconds: 30

kubectl apply -f hpa.yaml

Task 2 - Add CPU Requests, HPA Activates

Set CPU requests on the deployment. HPA uses them as the denominator: utilization% = actual_cpu / requests.cpu × 100.

Steps:

Set CPU requests of 200m on the web deployment

kubectl get hpa web

kubectl describe hpa web

Hint: Set resource requests imperatively

kubectl set resources deployment web --requests=cpu=200m

After a few seconds the TARGETS column updates from <unknown> to a real percentage. At idle, the app uses very little CPU - the percentage will be low. The HPA is now active and will act if utilization crosses 50%.

Task 3 - Generate Load and Watch Scale-Up

Send traffic to the app. The registry.k8s.io/hpa-example image runs a PHP script that does CPU work on every request. Under load, utilization climbs above 50% and HPA adds replicas.

Steps:

Create a load generator pod that floods the service:

kubectl run load-gen \
  --image=busybox \
  --restart=Never \
  -- sh -c "while true; do wget -q -O- http://web; done"

Watch HPA react:

kubectl get hpa web -w

As utilization rises above 100m (50% of 200m requested), HPA scales the deployment up. With sustained load the replica count will reach 5 - the max defined in the HPA. At that point HPA stops adding replicas even if utilization stays high.

kubectl get pods

Task 4 - Stop the Load

Delete the load generator. CPU utilization drops and HPA will scale back down to the minimum replica count.

kubectl delete pod load-gen

kubectl describe hpa web

Scale-down does not happen immediately. HPA uses a stabilization window (5 minutes by default) before reducing replicas - this prevents flapping when load briefly dips. The Events section in describe shows the full scaling history.

kubectl get hpa web -w

This challenge is part of the Kubernetes Pod Scheduling skill path.