Search for a String in 1000 files and each file size is 1GB

bash shell unix hadoop grep

You could write a simple MapReduce job to achieve this if you want. You don't actually need any reducers though, so the number of reducers would be set to zero. This way you can make use of the parallel processing power of MapReduce and chunk though the files much faster than a serial grep.

Just set up a Mapper that can be configured to search for the string you want. You will probably read in the files using the TextInputFormat, split the line and check for the values you are searching for. You can then write out the name of the current input file for the Mapper that matches.

Update:

To get going on this you could start with the standard word count example: http://wiki.apache.org/hadoop/WordCount. You can remove the Reducer and just modify the Mapper. It reads the input a line at a time where the line is contained in the value as a Text object. I dont know what format your data is, but you could even just convert the Text to a String and hardcode a .contains("") against that value to find the String you're searching for (for simplicity, not speed or best practice). You just need to workout which file the Mapper was processing when you get a hit and then write out the files name.

bash shell unix hadoop grep

You can get a hint from grep class. It comes with the distribution in the example folder.

./bin/hadoop jar hadoop-mapred-examples-0.22.0.jar grep input output regex

For details source on the implementation of this class you can go to the directory. "src\examples\org\apache\hadoop\examples" that comes with the distribution

So you can do this in your main class:

 Job searchjob = new Job(conf);     FileInputFormat.setInputPaths("job Name", "input direcotory in hdfs");      searchjob.setMapperClass(SearchMapper.class);          searchjob.setCombinerClass(LongSumReducer.class);      searchjob.setReducerClass(LongSumReducer.class);

In your SearchMapper.class you can do this.

   public void map(K key, Text value,                      OutputCollector<Text, LongWritable> output,                      Reporter reporter)        throws IOException {        String text = value.toString();        Matcher matcher = pattern.matcher(text);        if(matcher.find()) {          output.collect(key,value);}

bash shell unix hadoop grep

If you have 1000 files, is there any reason to use a finely-grained parallelized technique? Why not just use xargs, or gnu parallel, and split the work over the files, instead of splitting the work within a file?

Also it looks like you are grepping a literal string (not a regex); you can use the -F grep flag to search for string literals, which may speed things up, depending on how grep is implemented/optimized.

I haven't worked with mapReduce specifically, so this post may or may not be on point.

CodeHunter

Search for a String in 1000 files and each file size is 1GB

So you can do this in your main class:

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last