merge output files after reduce phase merge output files after reduce phase hadoop hadoop

merge output files after reduce phase


Instead of doing the file merging on your own, you can delegate the entire merging of the reduce output files by calling:

hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt

Note This combines the HDFS files locally. Make sure you have enough disk space before running


No, these files are not merged by Hadoop. The number of files you get is the same as the number of reduce tasks.

If you need that as input for a next job then don't worry about having separate files. Simply specify the entire directory as input for the next job.

If you do need the data outside of the cluster then I usually merge them at the receiving end when pulling the data off the cluster.

I.e. something like this:

hadoop fs -cat /some/where/on/hdfs/job-output/part-r-* > TheCombinedResultOfTheJob.txt


That's the function you can use to Merge Files in HDFS

public boolean getMergeInHdfs(String src, String dest) throws IllegalArgumentException, IOException {    FileSystem fs = FileSystem.get(config);    Path srcPath = new Path(src);    Path dstPath = new Path(dest);    // Check if the path already exists    if (!(fs.exists(srcPath))) {        logger.info("Path " + src + " does not exists!");        return false;    }    if (!(fs.exists(dstPath))) {        logger.info("Path " + dest + " does not exists!");        return false;    }    return FileUtil.copyMerge(fs, srcPath, fs, dstPath, false, config, null);}