What Azure Kubernetes (AKS) 'Time-out' happens to disconnect connections in/out of a Pod in my Cluster? What Azure Kubernetes (AKS) 'Time-out' happens to disconnect connections in/out of a Pod in my Cluster? kubernetes kubernetes

What Azure Kubernetes (AKS) 'Time-out' happens to disconnect connections in/out of a Pod in my Cluster?


Short Answer:

Azure AKS automatically deploys an Azure Load Balancer (with public IP address) when you add a new ingress (nGinx / Traefik... ANY Ingress) — that Load Balancer has its settings configured as a 'Basic' Azure LB which has a 4 minute idle connection timeout.

That idle timeout is both standard AND required (although you MAY be able to modify it, see here: https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-tcp-idle-timeout). That being said there is no way to ELIMINATE it entirely for any traffic that is heading externally OUT from the Load Balancer IP — the longest duration currently supported is 30 minutes.

There is no native Azure way to get around an idle connection being cut.

So as per the original question, the best way (I feel) to handle this is to leave the timeout at 4 minutes (since it has to exist anyway) and then setup your infrastructure to disconnect your connections in a graceful way (when idle) prior to hitting the Load Balancer timeout.

Our Solutions

For our Ghost Blog (which hit a MySQL database) I was able to roll back as mentioned above which made the Ghost process able to handle a DB disconnect / reconnect scenario.

What about Rails?

Yep. Same problem.

For a separate Rails based app we also run on AKS which is connecting to a remote Postgres DB (not on Azure) we ended up implementing PGbouncer (https://github.com/pgbouncer/pgbouncer) as an additional container on our Cluster via the awesome directions found here: https://github.com/edoburu/docker-pgbouncer/tree/master/examples/kubernetes/singleuser

Generally anyone attempting to access a remote database FROM AKS is probably going to need to implement an intermediary connection pooling solution. The pooling service sits in the middle (PGbouncer for us) and keeps track of how long a connection has been idle so that your worker processes don't need to care about that.

If you start to approach the Load Balancer timeout the connection pooling service will throw out the old connection and make a new fresh one (resetting the timer). That way when your client sends data down the pipe it lands on your Database server as anticipated.

In closing

This was an INSANELY frustrating bug / case to track down. We burned at least 2 dev-ops days figuring the first solution out but even KNOWING that it was probably the same issue we burned another 2 days this time around.

Even elongating the timer beyond the 4 minute default wouldn't really help since that would just make the problem more ephemeral to troubleshoot. I guess I just hope that anyone who has trouble connecting from Azure AKS / Kubernetes to a remote db is better at googling than I am and can save themselves some pain.

Thanks to MSFT Support (Kris you are the best) for the hint on the LB timer and to the dude who put together PGbouncer in a container so I didn't have to reinvent the wheel.