How to scale Azure's Kubernetes Service (AKS) based on GPU metrics? How to scale Azure's Kubernetes Service (AKS) based on GPU metrics? kubernetes kubernetes

How to scale Azure's Kubernetes Service (AKS) based on GPU metrics?


I managed to do this recently (just this week). I'll outline my solution and all the gotchas, in case that helps.

Starting with an AKS cluster, I installed the following components in order to harvest the GPU metrics:

  1. nvidia-device-plugin - to make GPU metrics collectable
  2. dcgm-exporter - a daemonset to reveal GPU metrics on each node
  3. kube-prometheus-stack - to harvest the GPU metrics and store them
  4. prometheus-adapter - to make harvested, stored metrics available to the k8s metrics server

The AKS cluster comes with a metrics server built in, so you don't need to worry about that. It is also possible to provision the cluster with the nvidia-device-plugin already applied, but currently not possible via terraform (Is it possible to use aks custom headers with the azurerm_kubernetes_cluster resource?), which is how I was deploying my cluster.

To install all this stuff I used a script much like the following:

helm repo add nvdp https://nvidia.github.io/k8s-device-pluginhelm repo add prometheus-community https://prometheus-community.github.io/helm-chartshelm repo add gpu-helm-charts https://nvidia.github.io/gpu-monitoring-tools/helm-chartshelm repo updateecho "Installing the NVIDIA device plugin..."helm install nvdp/nvidia-device-plugin \--generate-name \--set migStrategy=mixed \--version=0.9.0echo "Installing the Prometheus/Grafana stack..."helm install prometheus-community/kube-prometheus-stack \--create-namespace --namespace prometheus \--generate-name \--values ./kube-prometheus-stack.valuesprometheus_service=$(kubectl get svc -nprometheus -lapp=kube-prometheus-stack-prometheus -ojsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}')helm install prometheus-adapter prometheus-community/prometheus-adapter \--namespace prometheus \--set rbac.create=true,prometheus.url=http://${prometheus_service}.prometheus.svc.cluster.local,prometheus.port=9090helm install gpu-helm-charts/dcgm-exporter \--generate-name

Actually, I'm lying about the dcgm-exporter. I was experiencing a problem (my first "gotcha") where the dcgm-exporter was not responding to liveness requests in time, and was consistently entering a CrashLoopBackoff status (https://github.com/NVIDIA/gpu-monitoring-tools/issues/120). To get around this, I created my own dcgm-exporter k8s config (by taking details from here and modifying them slightly: https://github.com/NVIDIA/gpu-monitoring-tools) and applied it.In doing this I experienced my second "gotcha", which was that in the latest dcgm-exporter images they have removed some GPU metrics, such as DCGM_FI_DEV_GPU_UTIL, largely because these metrics are resource intensive to collect (see https://github.com/NVIDIA/gpu-monitoring-tools/issues/143). If you want to re-enable them make sure you run the dcgm-exporter with the arguments set as: ["-f", "/etc/dcgm-exporter/dcp-metrics-included.csv"] OR you can create your own image and supply your own metrics list, which is what I did by using this Dockerfile:

FROM nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04RUN sed -i -e '/^# DCGM_FI_DEV_GPU_UTIL.*/s/^#\ //' /etc/dcgm-exporter/default-counters.csvENTRYPOINT ["/usr/local/dcgm/dcgm-exporter-entrypoint.sh"]

Another thing you can see from the above script is that I also used my own Prometheus helm chart values file. I followed the instructions from nvidia's site (https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html), but found my third "gotcha" in the additionalScrapeConfig.

What I learned was that, in the final deployment, the HPA has to be in the same namespace as the service it's scaling (identified by targetRef), otherwise it can't find it to scale it, as you probably already know.

But just as importantly the dcgm-metrics Service also has to be in the same namespace, otherwise the HPA can't find the metrics it needs to scale by.So, I changed the additionalScrapeConfig to target the relevant namespace. I'm sure there's a way to use the additionalScrapeConfig.relabel_configs section to enable you to keep dcgm-exporter in a different namespace and still have the HPA find the metrics, but I haven't had time to learn that voodoo yet.

Once I had all of that, I could check that the DCGM metrics were being made available to the kube metrics server:

$ kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r . | grep DCGM_FI_DEV_GPU_UTIL

In the resulting list you really want to see a services entry, like so:

"name": "jobs.batch/DCGM_FI_DEV_GPU_UTIL","name": "namespaces/DCGM_FI_DEV_GPU_UTIL","name": "services/DCGM_FI_DEV_GPU_UTIL","name": "pods/DCGM_FI_DEV_GPU_UTIL",

If you don't it probably means that the dcgm-exporter deployment you used is missing the ServiceAccount component, and also the HPA still won't work.

Finally, I wrote my HPA something like this:

apiVersion: autoscaling/v2beta1kind: HorizontalPodAutoscalermetadata:  name: my-app-hpa  namespace: my-namespacespec:  scaleTargetRef:    apiVersion: apps/v1    kind: Deployment    name: my-app  minReplicas: X  maxReplicas: Y...  metrics:  - type: Object    object:      metricName: DCGM_FI_DEV_GPU_UTIL      targetValue: 80      target:        kind: Service        name: dcgm-exporter

and it all worked.

I hope this helps! I spent so long trying different methods shown by people on consultancy company blogs, medium posts etc before discovering that people who write these pieces have already made assumptions about your deployment which affect details that you really need to know about (eg: the namespacing issue).