Distributed file processing in Hadoop?

hadoop apache-spark batch-processing apache-flink

It is possible to instruct MapReduce to use an input format where the input to each Mapper is a single file. (from https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/WholeFileInputFormat.java?r=3)

public class WholeFileInputFormat extends FileInputFormat<NullWritable, BytesWritable> {  @Override  protected boolean isSplitable(JobContext context, Path filename) {    return false;  }  @Override  public RecordReader<NullWritable, BytesWritable> createRecordReader(    InputSplit inputSplit, TaskAttemptContext context) throws IOException,  InterruptedException {    WholeFileRecordReader reader = new WholeFileRecordReader();    reader.initialize(inputSplit, context);    return reader;  }}

Then, in your mapper, you can use the Apache commons compress library to unpack the tar file https://commons.apache.org/proper/commons-compress/examples.html

you don't need to pass a list of files to Hadoop, just put all the files in a single HDFS directory, and use that directory as your input path.

hadoop apache-spark batch-processing apache-flink

Distcp moves files from one place to another, you can take a look at its docs but I don't think it offers any decompress or unpack capability? If a file is bigger than main memory, you probably will get some out of memory errors. 8gb is not very big for a Hadoop cluster, how many machines do you have?

CodeHunter

Distributed file processing in Hadoop?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last