Hadoop MapReduce Streaming sorting on multiple columns Hadoop MapReduce Streaming sorting on multiple columns hadoop hadoop

Hadoop MapReduce Streaming sorting on multiple columns


you can achieve numerical sorting on multiple columns by specifying multiple k options in mapred.text.key.comparator.options (similarly to the linux sort command)

e.g. in bash

sort -k1,1 -k2rn

so for your example it would be

hadoop jar hadoop-streaming-1.2.1.jar \    -Dmapred.text.key.comparator.options='-k1,1 - k2rn' \    -Dmapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \    -mapper cat \    -reducer cat \    -file mr_base.py \    -file common.py \    -file mr_sort_combiner.py \    -input mr_combiner/2013_12_09__05_47_21/part-* \    -output mr_sort_combiner/2013_12_09__07_15_59/