Hadoop: Having threads inside map function

java multithreading hadoop mapreduce

Hadoop, by itself is built to do parallelism. But it is doing it in very coarse grained manner. Hadoop parallelism is good when dataset is big, and can be divided into many subsets which are processed separately and independently (here I am referring to the Map stage only, for simplicity), for example -to search one pattern in the text.
Now, lets consider the following case : We have a lot of data, and we want to search 1000's of different patterns in this text. Now we have two choices to utilize our multi-core CPUs.
1. Process each file using separate mapper in a single thread, and have several mappers per node
2. Define one mapper per node and process one file by all cores.
The second way might be much more cache friendly, and thereof to be more efficient.
In a bottom line - for cases when fine-grained, multi-core friendly parallelism is justified by the nature of processing - usage of multi-threading within mapper can benefit us.

java multithreading hadoop mapreduce

You shouldn't need threads if I understand Hadoop and map/reduce properly.

What makes you think parsing a single line of input is a bottleneck in your project? Does it merely seem to you that it's an issue or do you have data to prove it?

UPDATE: Thank you for the citation. It's obviously something that will have to be digested by me and others, so I won't have any snappy advice in the short term. But I appreciate the citation and your patience very much.

CodeHunter

Hadoop: Having threads inside map function

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last