HPA scale down not happening properly HPA scale down not happening properly kubernetes kubernetes

HPA scale down not happening properly


When the load decreases, the HPA intentionally waits a certain amount of time before scaling the app down. This is known as the cooldown delay and helps that the app is scaled up and down too frequently. The result of this is that for a certain time the app runs at the previous high replica count even though the metric value is way below the target. This may look like the HPA doesn't respond to the decreased load, but it eventually will.

However, the default duration of the cooldown delay is 5 minutes. So, if after 30-40 minutes the app still hasn't been scaled down, it's strange. Unless the cooldown delay has been set to something else with the --horizontal-pod-autoscaler-downscale-stabilization flag of the controller manager.

In the output that you posted the metric value is 49% with a target of 60% and the current replica count is 3. This seems actually not too bad.

An issue might be that you're using the memory utilisation as a metric, which is not a good autoscaling metric.

An autoscaling metric should linearly respond to the current load across the replicas of the app. If the number of replicas is doubled, the metric value should halve, and if the number of replicas is halved, the metric value should double. The memory utilisation in most cases doesn't show this behaviour. For example, if each replica uses a fixed amount of memory, then the average memory utilisation across the replicas stays roughly the same regardless of how many replicas were added or removed. The CPU utilisation generally works much better in this regard.


In this case Horizontal Pod Autoscaler is working as designed.

Autoscaler can be configured to use one or more metrics.

  1. Autoscaling based on a single metric - sums up the metrics values of all the pods, divides that by the target value set on the HorizontalPodAutoscaler resource, and then rounds it up to the next-larger integer.

desired_replicas = sum(utilization) / desired_utilization.

Example: When it's configured to scale considering CPU. If target is set to 30% and CPU usage is 97%: 97%/30%=3.23 and HPA will round it to 4 (next larger integer).

  1. Autoscaling based on multiple pod metrics - calculates the replica count for each metric individually and then takes the highest value.

Example: if three pods are required to achieve the target CPU usage, and two pods are required to achieve the target memory usage, the Autoscaler will scale to three pods - highest number needed to meet the target.

  1. Autoscaling on custom metrics - allows you to scale up/down based on non-resource metric types, for example scaling your frontend application based on Queries-Per-Second.

I hope it helps.