Efficient search in a corpus

algorithm search data-structures hadoop mapreduce

With MapReduce you shouldn't try and do everything in single step or job. It looks like you should split this problem up into multiple steps. Since you are generating the data that's stored on the HDFS, and you need to know the source you should probably go for a format something like:

{SOURCE},{WORD},{FREQUENCY}

Remember that you are talking about a distributed file system, so refering to your inputs as file1 and file2 isn't technically correct. Both your reference data and source data will be spread throughout the cluster, with pieces of each located on each node.

Next, starting with your pseudo code example you will need to create a job which correlates a word to the source and its frequency. Your mapper will work just fine, but the reduce will need to link the words to the sources. You will need to create your own Writable object which contains Map< source, frequency >. This will be output onto the HDFS as intermediate data your follow-on filter jobs can work with.

You can then use the output from this step as the input to 3 different MapReduce jobs. Where each is looking for the different combinations of sources. These jobs will be very simple, since the mapper will just pass through the same data, but the reducer will check each value for the different combinations of sources.

So if you take this approach you will need 4 MapReduce jobs. You don't need to run each one by hand, you can have a single job which runs each job sequentially. Alternatively, since the final 3 jobs will be using the same input data, you could start those three at the same time once the first has finished. This will probably depend on the amount of data and intermediate data your cluster is able to manage, and the number of mapper/reducers each job will require.

Hope this suggestion helps.

algorithm search data-structures hadoop mapreduce

This looks like a job for which the Aho-Corasick string search algorithm was designed for. I have never coded it myself, but googling a little should turn up some code.

Rabin-Karp might also work, but I have no idea how it works for multiple patterns when they are not all of the same length. Note: the multi-pattern pseudocode in the wikipedia article appears to be wrong. But should give you a starting point.

algorithm search data-structures hadoop mapreduce

In the spirit of quick and dirty:

fgrep --mmap -f query-file corpus-file

CodeHunter

Efficient search in a corpus

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last