Node status changes to unknown on a high resource requirement pod
As per the docs, for the node to be 'Ready':
True if the node is healthy and ready to accept pods, False if the node is not healthy and is not accepting pods, and Unknown if the node controller has not heard from the node in the last node-monitor-grace-period (default is 40 seconds)
If would seem that when you run your workloads your kube-apiserver doesn't hear from your node (kubelet) in 40 seconds. There could be multiple reasons, some things that you can try:
To see the 'Events' in your node run:
$ kubectl describe node <node-name>
To see if you see anything unusual on your kube-apiserver. On your active master run:
$ docker logs <container-id-of-kube-apiserver>
To see if you see anything unusual on your kube-controller-manager when your node goes into 'Unknown' state. On your active master run:
$ docker logs <container-id-of-kube-controller-manager>
Increase the
--node-monitor-grace-period
option in your kube-controller-manager. You can add it to the command line in the/etc/kubernetes/manifests/kube-controller-manager.yaml
and restart thekube-controller-manager
container.When the node is in the 'Unknown' state can you
ssh
into it and see if you can reach thekubeapi-server
? Both on<master-ip>:6443
and also thekubernetes.default.svc.cluster.local:443
endpoints.
Considering that the node was previously working and recently stopped showing the ready status restart your kubelet service. Just ssh into the affected node and execute:
/etc/init.d/kubelet restart
Back on your master node run kubectl get nodes to check if the node is working now