Merging multiple files into one within Hadoop Merging multiple files into one within Hadoop hadoop hadoop

Merging multiple files into one within Hadoop


In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) - add compression using MR flags.

hadoop jar \    $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \<br>    -Dmapred.reduce.tasks=1 \    -Dmapred.job.queue.name=$QUEUE \    -input "$INPUT" \    -output "$OUTPUT" \    -mapper cat \    -reducer cat

If you want compression add
-Dmapred.output.compress=true \-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec


hadoop fs -getmerge <dir_of_input_files> <mergedsinglefile>


okay...I figured out a way using hadoop fs commands -

hadoop fs -cat [dir]/* | hadoop fs -put - [destination file]

It worked when I tested it...any pitfalls one can think of?

Thanks!