How to automatically have dead Docker Swarm Manager replaced to have at least x manager running? How to automatically have dead Docker Swarm Manager replaced to have at least x manager running? docker docker

How to automatically have dead Docker Swarm Manager replaced to have at least x manager running?


You need to implement this with an external monitoring solution. It's not a built in capability of docker swarm mode.

Implementing this solution will be non-trivial. First, keep in mind when you promote a node, you are now giving it full administrative access over the swarm where a normal worker has none of that access, so make sure your security model is ok with this change. You also need to avoid cascade failures, where an overload of one manager causes it to fail, and automatically promoting other nodes causes them to immediately fail until there are no more workers as the existing workload is redistributed to fewer and fewer nodes. Lastly, when you add a new manager, you'll need to consider what to do with the reference to the currently failed manager. If it recovers, do you want it to continue where it left off, or do you want to have it completely removed from the swarm to reduce the number of nodes needed for quorum.

One last thing to note is when you lose quorum, nodes will continue to run the containers they have started. The only thing you lose is the ability to manage and make changes to that infrastructure. Therefore most places I've seen have 3 or 5 managers, depending on the level of fault tolerance needed, and often make the managers virtual so that if a failure occurs, the VM image can be easily restarted elsewhere in their environment.