Hadoop: Having threads inside map function Hadoop: Having threads inside map function hadoop hadoop

Hadoop: Having threads inside map function


Hadoop, by itself is built to do parallelism. But it is doing it in very coarse grained manner. Hadoop parallelism is good when dataset is big, and can be divided into many subsets which are processed separately and independently (here I am referring to the Map stage only, for simplicity), for example -to search one pattern in the text.
Now, lets consider the following case : We have a lot of data, and we want to search 1000's of different patterns in this text. Now we have two choices to utilize our multi-core CPUs.
1. Process each file using separate mapper in a single thread, and have several mappers per node
2. Define one mapper per node and process one file by all cores.
The second way might be much more cache friendly, and thereof to be more efficient.
In a bottom line - for cases when fine-grained, multi-core friendly parallelism is justified by the nature of processing - usage of multi-threading within mapper can benefit us.


You shouldn't need threads if I understand Hadoop and map/reduce properly.

What makes you think parsing a single line of input is a bottleneck in your project? Does it merely seem to you that it's an issue or do you have data to prove it?

UPDATE: Thank you for the citation. It's obviously something that will have to be digested by me and others, so I won't have any snappy advice in the short term. But I appreciate the citation and your patience very much.