Flink HA JobManager cluster cannot elect a leader Flink HA JobManager cluster cannot elect a leader kubernetes kubernetes

Flink HA JobManager cluster cannot elect a leader


According to the logs it looks as if the TaskManager cannot connect to the new leader. I assume that this is the same for the web ui. The logs say that it tries to connect to flink-job-manager-0.flink-job-svc.flink.svc.cluster.local/10.244.3.166:44013. I cannot say from the logs whether flink-job-manager-1 binds to this IP. But my suspicion is that the headless service might return multiple IPs and Flink picks the wrong/old one. Could you log into the flink-job-manager-1 pod and check what its IP address is?

I think you should be able to resolve this problem by defining for each JobManager a dedicated service or if you use the pod hostname instead.