What do multi-processes VS multi-threaded servers most benefit from? What do multi-processes VS multi-threaded servers most benefit from? multithreading multithreading

What do multi-processes VS multi-threaded servers most benefit from?


Unicorn is process based, which means that each instance of ruby will have to exist in its own process. That can be in the area of 500mb's for each process, which will quickly drain system resources. Puma, being thread based, won't use the same amount of memory to THEORETICALLY attain the same amount of concurrency.

Unicorn, being that multiple processes are run, will have parallelism between the different processes. This is limited by your CPU cores (more cores can run more processes at the same time), but the kernel will switch between active processes so more than 4 or 8 processes (however many cores you have) can be run. You will be limited by your machine's memory. Until recently, ruby was not copy-on-write friendly, which meant that EVERY process had its own inherited memory (unicorn is a preforking server). Ruby 2.0 is copy-on-write friendly, which could mean that unicorn won't actually have to load all of the children processes in memory. I'm not 100% clear on this. Read about copy on write, and check out jessie storimer's awesome book 'working with unix processes'. I'm pretty sure he covered it in there.

Puma is a threaded server. MRI Ruby, because of the global interpreter lock (GIL), can only run a single CPU bound task at a time (cf. ruby tapas episode 127, parallel fib). It will context switch between the threads, but as long as it is a CPU bound task (e.g. data processing) it will only ever run a single thread of execution. This gets interesting if you run your server with a different implementation of Ruby, like JRuby or Rubinius. They do not have the GIL, and can process a great deal of information in parallel. JRuby is pretty speedy, and while Rubinius is slow compared to MRI, multithreaded Rubinius processes data faster than MRI. During non-blocking IO, however, (e.g. writing to a database, making a web request), MRI will context switch to a non-executing thread and do work there, and then switch back to the previous thread when information has been returned.

For Unicorn, I would say the bottleneck is memory and clock speed. For Puma, I would say the bottleneck is your choice of interpreter (MRI vs Rubinius or JRuby) and the type of work your server is doing (lots of cpu bound tasks vs non-blocking IO).

There are tons of great resources on this debate. Check out Jessie Storimer's books on these topics, working with ruby threads and working with unix processes; read this quick summary of preforking servers by ryan tomayko, and google around for more info.

I don't know what the best worker amount is for Unicorn or Puma in your case. The best thing to do is run performance tests and do what is right for you. There is no one size fits all. (although I think the puma standard is to use a pool of 16 threads and lock it at that)


Puma is actually both multithreaded and multiprocess. You can invoke it in "clustered mode" where it will spawn off multiple forked workers which will run on different cores on MRI. Since Puma is multithreaded its probably appropriate to run a number of processes equal to the number of cores on the server. So for a 4 core server something like this would be appropriate:

puma -t 8:32 -w 4 --preload

This will handle up to 32 concurrent threads, with up to 4 threads running on the CPUs concurrently and should be able to maximize the CPU resources on the server. The --preload argument preloads the app and takes advantage of the ruby 2.0 COW improvements to the garbage collection to reduce RAM usage.

If your app spends considerable time waiting on other services (search services, databases, etc) then this will be a large improvement. When a thread blocks, another thread in the same process can grab the CPU and do work. You can support up to 32 requests in parallel in this example, while only taking the hit of running 4 processes in RAM.

With Unicorn, you would have to fork off 32 workers which would take the hit of running 32 processes in RAM, which is highly wasteful.

If all your app does is CPU crunching then this will be highly inefficient, and you should reduce the number of unicorns, and the benefits of Puma over Unicorn would be reduced. But in the Unicorn case, you have to benchmark your app and figure out the right number. Puma will tend to optimize itself by spawning more threads, and its performance should range from no worse than Unicorn (in the pure CPU case) to being vastly better than Unicorn (in the case of an app that sleeps a lot).

Of course if you use Rubinius or JRuby then its no contest, and you can spawn one process that runs multicore and handles all 32 threads.

TL;DR is that I don't think there's much advantage to Unicorn over Puma since Puma actually uses both models.

Of course I don't know anything about the reliability of Puma vs Unicorn in running production software in the real world. One thing to be concerned about is that if you scribble over any global state in one thread it can affect other requests executing at the same time which may produce indeterminate results. Since Unicorn doesn't use threads there are no concurrency issues. I would hope that by this time both Puma and Rails are mature with respect to concurrency issues and that Puma was usable in production. However, I would not necessarily expect every rails plugin and rubygem that I found on GitHub to be threadsafe, and would expect to have to do some additional work. But once you're successful enough to be finding threading problems in third party libraries you're probably large enough that you can't afford the RAM cost of running so many Unicorn processes. OTOH, I understand concurrency bugs and I'm good with Ruby, so that debugging cost may be much less for me than the cost of buying RAM in the cloud. YMMV.

Also note that I'm not sure if you should count hyperthreaded cores or physical cores in estimating the value to pass to '-w' and you'd need to perf test yourself, along with perf testing what values to use for -t. Although even if you run twice the number of processes that you 'need' to, the process scheduler in the kernel should be able to handle that without trouble until you saturate the CPU in which case you'll have larger issues anyway. I would probably recommend starting a process for each hyperthreaded core (on MRI).