Why do Hadoop jobs need so many threads? Why do Hadoop jobs need so many threads? hadoop hadoop

Why do Hadoop jobs need so many threads?


All hadoop implementations I've seen multithread heavily. Basically, most tasks that move work from map tasks into reducers are paralellized, as are map tasks and reduce tasks themselves.

Checking "Hadoop - The Definitive Guide", the author mentions a number of processes that are multithreaded. These include

  1. Reducers have a small pool of "copier" threads to fetch map outputs in paralell.
  2. Mappers themselves can be multithreaded (MultithreadedMapper)
  3. DataNodes have threads to copy data on and off HDFS.

Depending on how your cluster is configured, you can have DataNodes and TaskTrackers on the same machine, and this can start to add up to a lot of threads.

I'd guess that heavy use of concurrency has significant performance benefits, and that's why the implementors have gone that route.


As mentioned by Chrylis, the JVM have some GC threads and possibly other threads running.

When it comes to user applications, multiple threads can be very useful.

An example of this is a case where you open a file, read each line and then do some processsing. While the thread is reading from file, the CPU is usually not working much because it will spend most of it's time waiting for the slow harddisk to return data. By using multiple threads, the CPU is utilized better. If your program used threads, some threads could be doing something useful while other threads are waiting for IO operations to complete.

I haven't used Hadoop, but I assume that when splitting the work, a node may in fact be running multiple jobs for this reason. They probably also contain some threads for coordinating with other parts of the cluster.