Terraform created AWS ECS infra: health check keep failing Terraform created AWS ECS infra: health check keep failing nginx nginx

Terraform created AWS ECS infra: health check keep failing


I kind of figured out by myself. While I never get the container level health check passed, I managed to fix the health check failure on application load balancer.

Problem & Cause

It turns out that it has something to do with the security group of the EC2 instance. I notice this when I was following a AWS troubleshooting page for health check failure, where they advise to ssh into the instance and try a curl -v ... on the instance directly. The curl failed, and I found that my EC2 instance security group was using the default sg. While the default security group (sg) allows all traffic, it limits its source to itself, i.e. default security group. This can be confusing, but I think it indicates that it's only allowing traffic from aws services which use default security group as well. Regardless, this blocks any traffic outside of aws service, so I can't access via my domain name, nor does the ALB health check agent.

Solution

My final solution is to have a dedicated security group for ALB, and then create a new security group for EC2 instances that allows only traffic from ALB's security group. Also note that since we already limit port to 80 & 443 in ALB's security group, and now EC2 instance sg is set behind ALB's sg (all internal traffic now), there's no need to limit port to 80 / 443 in EC2 instance sg. You can leave it as 0 to allow all port. If you limit to the wrong port, the health check will start failing. See the following from the AWS trouble shooting page:

  1. Confirm that the security group associated with your container instance allows all ingress traffic on the ephemeral port range (typically ports 32768-65535) from the security group associated with your load balancer

Important: If you declare the host port in your task definition, the service will be exposed on the specified port rather than in the ephemeral port range. For this reason, be sure that your security group reflects the specified host port instead of the ephemeral port range.


Other Concerns

This really took me alot of effort and time to figure out. A small side note is that I still can't get the container level health check to work, which is defined in task definition of AWS ECS. I tried ssh into the container instance (EC2 instance), and it turns out localhost is apparently not working. Even the AWS trouble shooting page is using some ip address generated from docker inspect when testing the curl on EC2 instance directly. But then for the task definition container health check, if not checking on localhost, what should I check on? Should I run docker inspect in the health check command as well to obtain the ip address first? This problem remains unsolved, now I just give a exit 0 to bypass the health check. If anyone knows what is the correct way to configure this, feel free to share and I really want to know as well.