How to use K8S node_problem_detector? How to use K8S node_problem_detector? kubernetes kubernetes

How to use K8S node_problem_detector?


"This tool aims to make various node problems visible to the upstream layers in cluster management stack. It is a daemon which runs on each node, detects node problems and reports them to apiserver."

Err Ok but... What does that actually mean? How can I tell if it went to the api server?
What does the before and after look like? Knowing that would help me understand what it's doing.

Before installing Node Problem Detector I see:

Bash# kubectl describe node ip-10-40-22-166.ec2.internal | grep -i condition -A 20 | grep Ready -B 20Conditions:  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message  ----                 ------  -----------------                 ------------------                ------                       -------  NetworkUnavailable   False   Thu, 20 Jun 2019 12:30:05 -0400   Thu, 20 Jun 2019 12:30:05 -0400   WeaveIsUp                    Weave pod has set this  OutOfDisk            False   Thu, 20 Jun 2019 18:27:39 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientDisk     kubelet has sufficient disk space available  MemoryPressure       False   Thu, 20 Jun 2019 18:27:39 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available  DiskPressure         False   Thu, 20 Jun 2019 18:27:39 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure  PIDPressure          False   Thu, 20 Jun 2019 18:27:39 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available  Ready                True    Thu, 20 Jun 2019 18:27:39 -0400   Thu, 20 Jun 2019 12:30:14 -0400   KubeletReady                 kubelet is posting ready status

After installing Node Problem Detector I see:

Bash# helm upgrade --install npd stable/node-problem-detector -f node-problem-detector.values.yaml Bash# kubectl rollout status daemonset npd-node-problem-detector #(wait for up) Bash# kubectl describe node ip-10-40-22-166.ec2.internal | grep -i condition -A 20 | grep Ready -B 20 Conditions:  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message  ----                 ------  -----------------                 ------------------                ------                       -------  DockerDaemon         False   Thu, 20 Jun 2019 22:06:17 -0400   Thu, 20 Jun 2019 22:04:14 -0400   DockerDaemonHealthy          Docker daemon is healthy  EBSHealth            False   Thu, 20 Jun 2019 22:06:17 -0400   Thu, 20 Jun 2019 22:04:14 -0400   NoVolumeErrors               Volumes are attaching successfully  KernelDeadlock       False   Thu, 20 Jun 2019 22:06:17 -0400   Thu, 20 Jun 2019 22:04:14 -0400   KernelHasNoDeadlock          kernel has no deadlock  ReadonlyFilesystem   False   Thu, 20 Jun 2019 22:06:17 -0400   Thu, 20 Jun 2019 22:04:14 -0400   FilesystemIsNotReadOnly      Filesystem is not read-only  NetworkUnavailable   False   Thu, 20 Jun 2019 12:30:05 -0400   Thu, 20 Jun 2019 12:30:05 -0400   WeaveIsUp                    Weave pod has set this  OutOfDisk            False   Thu, 20 Jun 2019 22:07:10 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientDisk     kubelet has sufficient disk space available  MemoryPressure       False   Thu, 20 Jun 2019 22:07:10 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available  DiskPressure         False   Thu, 20 Jun 2019 22:07:10 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure  PIDPressure          False   Thu, 20 Jun 2019 22:07:10 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available  Ready                True    Thu, 20 Jun 2019 22:07:10 -0400   Thu, 20 Jun 2019 12:30:14 -0400   KubeletReady                 kubelet is posting ready status

Note I asked for help coming up with a way to see this for all nodes, Kenna Ofoegbu came up with this super useful and readable gem:

zsh# nodes=$(kubectl get nodes | sed '1d' | awk '{print $1}') && for node in $nodes; do;  kubectl describe node | sed -n '/Conditions/,/Ready/p' ; doneBash# (same command, gives errors)


Ok so now I know what Node Problem Detector does but... what good is adding a condition to the node, how do I use the condition to do something useful?

Question: How to use Kubernetes Node Problem Detector?
Use Case #1: Auto heal borked nodes
Step 1.) Install Node Problem Detector, so it can attach new condition metadata to nodes.
Step 2.) Leverage Planetlabs/draino to cordon and drain nodes with bad conditions.
Step 3.) Leverage https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler to auto heal. (When the node is cordon and drained it'll be marked unscheduleable, this will trigger a new node to be provisioned, and then the bad node's resource utilization will be super low which will cause the bad node to get deprovisioned)

Source: https://github.com/kubernetes/node-problem-detector#remedy-systems


Use Case #2: Surface the unhealthy node event so that it can be detected by Kubernetes, and then injested into your monitoring stack so you have an auditable historic record that the event occurred and when.
These unhealthy node events are logged somewhere on the host node, but usually, the host node is generating so much noisy/useless log data that these events aren't usually collected by default.
Node Problem Detector knows where to look for these events on the host node and filters out the noise when it sees the signal of a negative outcome it'll post it to its pod log, which isn't noisy.
The pod log is likely getting ingested into an ELK and Prometheus Operator stack, where it can be detected, alerted on, stored, and graphed.

Also, note that nothing is stopping you from implementing both use cases.


Update, added a snippet of node-problem-detector.helm-values.yaml file per request in comment:

  log_monitors:#https://github.com/kubernetes/node-problem-detector/tree/master/config contains the full list, you can exec into the pod and ls /config/ to see these as well.    - /config/abrt-adaptor.json #Adds ABRT Node Events (ABRT: automatic bug reporting tool), exceptions will show up under "kubectl describe node $NODENAME | grep Events -A 20"    - /config/kernel-monitor.json #Adds 2 new Node Health Condition Checks "KernelDeadlock" and "ReadonlyFilesystem"    - /config/docker-monitor.json  #Adds new Node Health Condition Check "DockerDaemon" (Checks if Docker is unhealthy as a result of corrupt image)#    - /config/docker-monitor-filelog.json #Error: "/var/log/docker.log: no such file or directory", doesn't exist on pod, I think you'd have to mount node hostpath to get it to work, gain doesn't sound worth effort.#    - /config/kernel-monitor-filelog.json #Should add to existing Node Health Check "KernelDeadlock", more thorough detection, but silently fails in NPD pod logs for me.     custom_plugin_monitors: #[]# Someone said all *-counter plugins are custom plugins, if you put them under log_monitors, you'll get #Error: "Failed to unmarshal configuration file "/config/kernel-monitor-counter.json""    - /config/kernel-monitor-counter.json #Adds new Node Health Condition Check "FrequentUnregisteredNetDevice"    - /config/docker-monitor-counter.json #Adds new Node Health Condition Check "CorruptDockerOverlay2"    - /config/systemd-monitor-counter.json #Adds 3 new Node Health Condition Checks "FrequentKubeletRestart", "FrequentDockerRestart", and "FrequentContainerdRestart"    


Considering node-problem-detector is a Kubernetes addon, you would need to install that addon on your own Kubernetes server.

A Kubernetes CLuster has an addon-manager that will use it.


Do you mean:how to install it?

kubectl create -f https://github.com/kubernetes/node-problem-detector.yaml