Merging multiple files into one within Hadoop
In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) - add compression using MR flags.
hadoop jar \ $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \<br> -Dmapred.reduce.tasks=1 \ -Dmapred.job.queue.name=$QUEUE \ -input "$INPUT" \ -output "$OUTPUT" \ -mapper cat \ -reducer cat
If you want compression add
-Dmapred.output.compress=true \-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
okay...I figured out a way using hadoop fs
commands -
hadoop fs -cat [dir]/* | hadoop fs -put - [destination file]
It worked when I tested it...any pitfalls one can think of?
Thanks!