docker swarm not restarting unhealthy selenium hub containers

docker selenium docker-compose docker-swarm

First, I'd leave the restart_policy out. Swarm mode will recover a failed container for you, and this policy is handled outside of swarm mode and could result in unexpected behavior. Next, to debug a healthcheck, since you have configured it with multiple retries, timeouts, and a start period, is to inspect the container. E.g. you can run the following:

docker container inspect $container_id --format '{{json .State.Health}}' | jq .

The output from that will show the current status of the container, including a log of any healthcheck results over time. If that shows the container is failing for more than 3 retries and unhealthy, then check the service state:

docker service inspect $service_name --format '{{json .UpdateStatus}}' | jq .

That should show whether there is currently an update in progress, whether the rollout of a change has resulted in any issues.

One other thing to look at is the memory limit. Without a corresponding memory reservation, the scheduler may be using the limit as a reservation (I'd need to test this) and if you don't have 10G of memory available that hasn't been reserved by other containers, the scheduler may fail to reschedule the service. The easy solution to this is to specify a smaller reservation that you want to ensure is always available on the node when scheduling the containers. E.g.

   deploy:     resources:       limits:         memory: 5000M       reservations:         memory: 1000M

Based on the latest debugging output:

docker container inspect 1abfa546cc26 --format '{{json .State.Health}}' | jq .runtime/cgo: pthread_create failed: Resource temporarily unavailableSIGABRT: abortPC=0x7fa114765fff m=0 sigcode=18446744073709551610

This suggests the host itself is causing the issues, or the docker engine, and not your container's configuration. If you haven't already, I'd ensure that you are running the most recent stable release from docker. At last check, that's 19.03.9. I'd check other OS logs in /var/log/ for any other errors on the host. I'd check for resource limits being reached, things like memory, and any process/thread related sysctl settings (e.g. kernel.pid_max). With docker I also recommend keeping your kernel and systemd versions updated, and reboot after and update to those for the changes to apply.

I'd also recommend reviewing this unix.se post on the same error that has a few other things to try.

If none of those help, you can contribute details to reproduce your scenario to similar open issues at:

docker selenium docker-compose docker-swarm

For the 1st question, Swarm is expected to restart the unhealthy container whenever the N number of retries is failed. If you want to dig down into this, monitor the docker events with the following command

docker events --filter event=health_status

For the second question:- Whenever hub restarts, all the nodes gets restarted and that's expected because hub holds a session with all the nodes and when you restart the hub, it resets all the sessions and set up the new nodes.

CodeHunter

docker swarm not restarting unhealthy selenium hub containers

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last