Search for a String in 1000 files and each file size is 1GB Search for a String in 1000 files and each file size is 1GB hadoop hadoop

Search for a String in 1000 files and each file size is 1GB


You could write a simple MapReduce job to achieve this if you want. You don't actually need any reducers though, so the number of reducers would be set to zero. This way you can make use of the parallel processing power of MapReduce and chunk though the files much faster than a serial grep.

Just set up a Mapper that can be configured to search for the string you want. You will probably read in the files using the TextInputFormat, split the line and check for the values you are searching for. You can then write out the name of the current input file for the Mapper that matches.

Update:

To get going on this you could start with the standard word count example: http://wiki.apache.org/hadoop/WordCount. You can remove the Reducer and just modify the Mapper. It reads the input a line at a time where the line is contained in the value as a Text object. I dont know what format your data is, but you could even just convert the Text to a String and hardcode a .contains("") against that value to find the String you're searching for (for simplicity, not speed or best practice). You just need to workout which file the Mapper was processing when you get a hit and then write out the files name.


You can get a hint from grep class. It comes with the distribution in the example folder.

./bin/hadoop jar hadoop-mapred-examples-0.22.0.jar grep input output regex

For details source on the implementation of this class you can go to the directory. "src\examples\org\apache\hadoop\examples" that comes with the distribution

So you can do this in your main class:

 Job searchjob = new Job(conf);     FileInputFormat.setInputPaths("job Name", "input direcotory in hdfs");      searchjob.setMapperClass(SearchMapper.class);          searchjob.setCombinerClass(LongSumReducer.class);      searchjob.setReducerClass(LongSumReducer.class);

In your SearchMapper.class you can do this.

   public void map(K key, Text value,                      OutputCollector<Text, LongWritable> output,                      Reporter reporter)        throws IOException {        String text = value.toString();        Matcher matcher = pattern.matcher(text);        if(matcher.find()) {          output.collect(key,value);}


If you have 1000 files, is there any reason to use a finely-grained parallelized technique? Why not just use xargs, or gnu parallel, and split the work over the files, instead of splitting the work within a file?

Also it looks like you are grepping a literal string (not a regex); you can use the -F grep flag to search for string literals, which may speed things up, depending on how grep is implemented/optimized.

I haven't worked with mapReduce specifically, so this post may or may not be on point.