StatefulSet recreates pod, why? StatefulSet recreates pod, why? kubernetes kubernetes

StatefulSet recreates pod, why?


You should look into two things:

  1. Debug Pods

Check the current state of the pod and recent events with the following command:

kubectl describe pods ${POD_NAME} Look at the state of the containers in the pod. Are they all Running? Have there been recent restarts?

Continue debugging depending on the state of the pods.

Especially take a closer look at why the Pod crashed.

More info can be found in the links I have provided.

  1. Debug StatefulSets.

StatefulSets provide a debug mechanism to pause all controller operations on Pods using an annotation. Setting the pod.alpha.kubernetes.io/initialized annotation to "false" on any StatefulSet Pod will pause all operations of the StatefulSet. When paused, the StatefulSet will not perform any scaling operations. Once the debug hook is set, you can execute commands within the containers of StatefulSet pods without interference from scaling operations. You can set the annotation to "false" by executing the following:

kubectl annotate pods <pod-name> pod.alpha.kubernetes.io/initialized="false" --overwrite

When the annotation is set to "false", the StatefulSet will not respond to its Pods becoming unhealthy or unavailable. It will not create replacement Pods till the annotation is removed or set to "true" on each StatefulSet Pod.

Please let me know if that helped.


Another nifty little trick I came up with is to describe the pod as soon as it stops logging, by using

kubectl logs -f mypod && kubectl describe pod mypod

When the pod fails and stops logging, the kubectl logs -f mypod will terminate and then the shell will immediately execute kubectl describe pod mypod, (hopefully) letting you catch the state of the failing pod before it is recreated.

In my case it was showing

    Last State:     Terminated      Reason:       OOMKilled      Exit Code:    137

in line with what Timothy is saying


kubectl log -p postgresPod the -p will give you the previous logs (if it's a simple restart).

There's a whole bunch of "know the rest of your environment" that beg to be asked here. Do you know how many nodes make up your cluster (are we talking 1 or two or are we talking 10's 100's or more). Do you know if they are dedicated instances or are you on a cloud provider like AWS using spot instances.

Take a look at kubectl get nodes it will it should give you the age of your nodes.

Do you have requests and limits set on your pod? Do a kubectl describe ${POD_NAME}. Among the requests, limits etc you'll see which node the pod is on. Describe the node it will have CPU and memory details. Can your pod live within those. Is your app configured to live within those limits ? If you don't have limits set then your pod could easily consume so many resources that the kernel oom killer terminates your pod. If you do have pod limits, but have misconfigured your app then K8s may be killing your app because it is breaching the limits

If you have access to the node then take a look at dmesg to see if OOM-Killer has terminated any of your pods. If you don't have access get someone who does to take a look at the logs. When you're describing the node look for pods with 0 limits as that is unlimited and they may be misbehaving and causing your app to be killed because it was unlucky enough to request more resource from the system when there was non available due to unlimited apps.