Why do Hadoop jobs need so many threads?

java multithreading hadoop apache-pig

All hadoop implementations I've seen multithread heavily. Basically, most tasks that move work from map tasks into reducers are paralellized, as are map tasks and reduce tasks themselves.

Checking "Hadoop - The Definitive Guide", the author mentions a number of processes that are multithreaded. These include

Reducers have a small pool of "copier" threads to fetch map outputs in paralell.
Mappers themselves can be multithreaded (MultithreadedMapper)
DataNodes have threads to copy data on and off HDFS.

Depending on how your cluster is configured, you can have DataNodes and TaskTrackers on the same machine, and this can start to add up to a lot of threads.

I'd guess that heavy use of concurrency has significant performance benefits, and that's why the implementors have gone that route.

java multithreading hadoop apache-pig

As mentioned by Chrylis, the JVM have some GC threads and possibly other threads running.

When it comes to user applications, multiple threads can be very useful.

An example of this is a case where you open a file, read each line and then do some processsing. While the thread is reading from file, the CPU is usually not working much because it will spend most of it's time waiting for the slow harddisk to return data. By using multiple threads, the CPU is utilized better. If your program used threads, some threads could be doing something useful while other threads are waiting for IO operations to complete.

I haven't used Hadoop, but I assume that when splitting the work, a node may in fact be running multiple jobs for this reason. They probably also contain some threads for coordinating with other parts of the cluster.

CodeHunter

Why do Hadoop jobs need so many threads?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last