Distributed file processing in Hadoop? Distributed file processing in Hadoop? hadoop hadoop

Distributed file processing in Hadoop?


It is possible to instruct MapReduce to use an input format where the input to each Mapper is a single file. (from https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/WholeFileInputFormat.java?r=3)

public class WholeFileInputFormat extends FileInputFormat<NullWritable, BytesWritable> {  @Override  protected boolean isSplitable(JobContext context, Path filename) {    return false;  }  @Override  public RecordReader<NullWritable, BytesWritable> createRecordReader(    InputSplit inputSplit, TaskAttemptContext context) throws IOException,  InterruptedException {    WholeFileRecordReader reader = new WholeFileRecordReader();    reader.initialize(inputSplit, context);    return reader;  }}

Then, in your mapper, you can use the Apache commons compress library to unpack the tar file https://commons.apache.org/proper/commons-compress/examples.html

you don't need to pass a list of files to Hadoop, just put all the files in a single HDFS directory, and use that directory as your input path.


Distcp moves files from one place to another, you can take a look at its docs but I don't think it offers any decompress or unpack capability? If a file is bigger than main memory, you probably will get some out of memory errors. 8gb is not very big for a Hadoop cluster, how many machines do you have?