Distributed file processing in Hadoop?
It is possible to instruct MapReduce to use an input format where the input to each Mapper is a single file. (from https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/WholeFileInputFormat.java?r=3)
public class WholeFileInputFormat extends FileInputFormat<NullWritable, BytesWritable> { @Override protected boolean isSplitable(JobContext context, Path filename) { return false; } @Override public RecordReader<NullWritable, BytesWritable> createRecordReader( InputSplit inputSplit, TaskAttemptContext context) throws IOException, InterruptedException { WholeFileRecordReader reader = new WholeFileRecordReader(); reader.initialize(inputSplit, context); return reader; }}
Then, in your mapper, you can use the Apache commons compress library to unpack the tar file https://commons.apache.org/proper/commons-compress/examples.html
you don't need to pass a list of files to Hadoop, just put all the files in a single HDFS directory, and use that directory as your input path.
Distcp moves files from one place to another, you can take a look at its docs but I don't think it offers any decompress or unpack capability? If a file is bigger than main memory, you probably will get some out of memory errors. 8gb is not very big for a Hadoop cluster, how many machines do you have?