Split a file into no of small files in HDFS Split a file into no of small files in HDFS hadoop hadoop

Split a file into no of small files in HDFS


A simple Hadoop Streaming job with the input format as NLineInputFormat can get this done.

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-<version>.jar \   -Dmapreduce.input.lineinputformat.linespermap=10 \   -Dmapreduce.job.reduces=0 \   -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \   -mapper org.apache.hadoop.mapred.lib.IdentityMapper \   -input /test.txt \   -output /splitted_output

Here the property mapreduce.input.lineinputformat.linespermap determine the number of lines each split must contain.