100 docker containers vs 100 small machines

linux amazon-ec2 concurrency parallel-processing docker

The only "appropriate" answer to your question is: you have to test both options and find out which one is better. The reason for this is: you are running a very specific application, with a very specific workload and very specific requirements. Any recommendation without actual testing is a guess. Maybe an "educated guess", but not more than that.

That said, let me show you what I would consider when doing my analysis for such a scenario.

The docker overhead should be absolutely minimal. The tool "docker" itself is not doing anything -- it's just using regular Linux Kernel features to create an isolated environment for your application.
After the OS has booted up, it will consume some memory, true. But the CPU consumption by the OS itself should be negligible (even for very small instances). Since you mentioned that you don't have any problems with memory, it seems like we can assume that this "OS overhead" that your colleague mentioned would also probably be negligible.
If you consider the route "a lot of very small instances", you could also consider using the recently released t2.nano instance type. You need to test if it has enough resources to actually run your application.
If you consider the route "one single very large instance", you should also consider the c4.8xl instance. This should give you significantly more CPU power than the c3.8xl.
Cost analysis (prices in us-east-1):
- 1x c3.8xlarge: 1.68 $/h
- 1x c4.8xlarge: 1.675 $/h (roughly the same as c3.8xl)
- 100x t2.micro: 1.30 $/h (that is, 100x 0.013$/h = 1.30$/h)
- 100x t2.nano: 0.65 $/h. (winner) (that is, 100x 0.0065$/h = 0.65$/h)
Now let's analyze the amount of resources that you have on each setup. I'm focusing only on CPU and ignoring memory, since you mentioned that your application isn't memory hungry:
- 1x c3.8xlarge: 32 vCPU
- 1x c4.8xlarge: 36 vCPU (each one typically with better performance than the c3.8xl; the chip for c4 was specifically designed by Intel for EC2) (winner)
- 100x t2.micro: 100x 10% of a vCPU ~= "10 vCPU" aggregate
- 100x t2.nano: 100x 5% of a vCPU ~= "5 vCPU" aggregate
Finally, let's analyze the cost per resource
- 1x c3.8xlarge: 1.68$/h / 32 vCPU = 0.0525 $ / (vCPU-h)
- 1x c4.8xlarge: 0.0465 $ / (vCPU-h) (winner)
- 100x t2.micro: 0.13 $ / (vCPU-h)
- 100x t2.nano: 0.13 $ / (vCPU-h)

As you can see, larger instances typically provide higher compute density which is typically less expensive.

You should also consider the fact that the T2 instances are "burstable", in that they can go beyond their baseline performances (10% and 5% as above) for some time, depending on how much "CPU credits" they have. However, you should know that, although they start with some credit balance, it's typically enough for booting up the OS and not much more than that (you'd accrue more CPU credits over time if you don't push your CPU beyond your baseline, but it seems that it won't be the case here, since we are optimizing for performance...). And as we have seen, the "cost per resource" is nearly 3x that of the 8xl instances, so this short burst that you'll get probably wouldn't change this rough estimate.

You might also want to consider network utilization. Is the application network intensive? Either in latency requirements, or bandwidth requirements, or in the amount of packets per second?

The smaller instances have less network performance available, the larger instances have much more. But the network for each smaller instance would be used by one single application, while the powerful network of the larger instance would be shared among all the containers. The 8xl instances come with a 10Gbps network card. The t2 instances have very low network performance compared to that.

Now, what about resiliency? How time-sensitive are these jobs? What would be the "cost of not finishing them in a timely manner"? You might also want to consider some failure modes:

What happens if "one single instance dies"? In the case of 1x c3.8xl or 1x c4.8xl, your whole fleet would be down, and your workers would stop. Would they be able to "recover"? Would they need to "restart" their work? In the case of "a lot of small instances", well, the effect of "one single instance dying" might be much less impactful.

To reduce the effect of "one single instance dying" on your workload, and still get some benefits from higher density compute (ie, large c3 or c4 instances), you could consider other options, such as: 2x c4.4xl, or 4x c4.2xl, and so on. Notice that the c4.8xl costs twice the c4.4xl but it contains more than twice the number of vCPUs. So the analysis above wouldn't be "linear", you'd need to recalculate some costs.

Assuming that you are OK with instances "failing" and your application can somehow deal with that -- another interesting point to consider is using Spot instances. With Spot instances, you name your price. If the "market price" (regulated by offer - demand) is below your bid, you get the instance and pay only the "market price". If the price fluctuates above your bid, then your instances are terminated. It's not uncommon to see up to 90% discounts when compared to On Demand. As of right now, c3.8xl is approximately 0.28$/h in one AZ (83% less than On Demand) and c4.8xl is about the same in one AZ (83% less as well). Spot pricing is not available for t2 instances.

You can also consider Spot Block, in which you say the number of hours you want to run your instances for, you'll pay typically 30% - 45% less than On Demand and there's no risk of getting "out bided" during the period you specified. After the period, your instances are terminated.

Finally, I would try to size my fleet of servers so that they are required for nearly a "full number of hours" (but not exceeding that number) (unless I'm required to finish execution ASAP). That is, it's much better to have a smaller fleet that will finish the jobs in 50 minutes than a larger one capable of finishing the job in 10 minutes. The reason is: you pay by the hour, at the beginning of the hour. Also, it's typically much better to have a larger fleet that will finish the job in 50 minutes than a smaller one that would require 1h05 minutes -- again, because you pay by the hour, at the beginning of the hour.

Finally, you mention that you are looking for "the best performance". What exactly do you mean by that? What is your Key Performance Indicator? Do you want to optimize to reduce the amount of time spent in total? Maybe reduce the amount of time "per unit"/"per job"? Are you trying to reduce your costs? Are you trying to be more "energy efficient" and reduce your carbon footprint? Or maybe optimize for the amount of maintenance required? Or focus on simplifying to reduce the ramp-up period that other colleagues less knowledgeable would require before they would be able to maintain the solution?

Maybe a combination of many of the performance indicators above? How would they combine? There's always a trade off...

As you can see, there isn't a clear winner. At least not without more specific information about your application and your optimization goals. That's why most of the time the best option for any kind of performance optimization is really testing it. Testing is also typically inexpensive: it would require maybe a couple hours of work to setup your environment, and then likely less than $2/h of test.

So, that should give you enough information to start your investigation.

Happy testing!

linux amazon-ec2 concurrency parallel-processing docker

Based on EC2 instances CPU alone

100 t2.micro's have less than a 1/3 of the CPU power of the c4.8xl
100 t2.small's have less than 2/3's
50 t2.large's might be getting close.

It's fairly likely the c4.8xl would be a lot quicker but there's no way anyone can authoritatively say that. As Bruno starts with, you need to run your workload through your app on both instance types to see. There are 1000's of variables that could affect the results. There's no immediate reason you can't run 100 containers/processes on a linux host, it's been doing multi core, multi process for a long time. There may be some simple system limits that need to be tuned as you go (ulimit -a and sysctl -a). On the other hand there might be something in your application performs really badly in this setup.

Any Docker overhead pretty much cancels out as you run a container in both your single and multi instance setup. Anything you can improve in this area is going to help limit the issues you run into on the shared host. IBM published a great report An Updated Performance Comparison of Virtual Machines and Linux Containers detailing Docker overheads.

Container overhead will be negligible for CPU/Memory.
NAT networking includes some overhead so use net=host.
AUFS disk will incur overhead. For anything write intensive mount your hosts EBS/SSD directly into the container.

One big difference with many Docker containers is the workload on the VM's process scheduler will be completely different. It's not necessarily going to be any worse when running more containers but you are heading into "less tested" territory. You rely less on the Xen hypervisor running on hardware and more on the linux kernel running inside the VM for scheduling which use two completely different algorithms. Again, Linux is capable of running many processes and dealing with contention but you're only going to find the answer for your app by testing it.

So at a complete guess, not taking your application into account in any way, purely on general machine specs and a CPU heavy workload.

The c4.8xl should be quicker than a t2.nano and t2.micro setup.
t2.smalls might get close.
50 x t2.larges with 2 containers might be on par.
100 x m3.mediums would blitz it.
3 x c4.8xl's would give you the quickest processor hyper thread dedicated to each process/thread of your app.

tl;dr Test them all. Use the quickest.

linux amazon-ec2 concurrency parallel-processing docker

In a general case, I would use docker. Simply because it easier and faster to add/change/remove nodes and do load balancing. I am assuming here that you do not need exactly 100 nodes, but rather a changing number of them.

It is also much faster to setup. There is a thing called docker swarm https://docs.docker.com/swarm/overview/ and I'm sure it's worth looking into. Additionally, if you can't put a 100 or 1000 or 10000 containers on one physical machine (VM even?) you can also distribute it with overlay network https://docs.docker.com/engine/userguide/networking/dockernetworks/#an-overlay-network

EDIT: after re-reading the question it seems that you want a more of a performance-wise answer. It's really hard to say (if not impossible) without testing. With performance there are always caveats and things you can't think of (on a simple account of being human:) ).Again to set up 100 or 3 or 999999 machines is much more time then setting up the same number of docker containers. I know that container/image terminology still gets mixed up, so just to clarify - you would create one docker image, which is some work, and than run it in N instances (which are containers). Someone please correct me for the terminology thing if I am wrong - this is something that I often discuss with colleagues :)

CodeHunter

100 docker containers vs 100 small machines

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last