Cannot read bz2 compressed file in hadoop job
You should look at your core-site.xml configuration file and add a class for BZip2 codec if it's absent.Here is an example:
<property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.SnappyCodec</value></property>
Edit:
After adding codec please reproduce next steps to see that it works (your code may not):
hadoop fs -mkdir /tmp/wordcount/echo "three one three three seven" >> /tmp/wordsbzip2 -z /tmp/wordshadoop fs -put /tmp/words.bz2 /tmp/wordcount/hadoop jar /usr/lib/hadoop/hadoop-examples.jar wordcount /tmp/wordcount/ /tmp/wordcount_out/hadoop fs -text /tmp/wordcount_out/part*#you should see next three lines:one 1seven 1three 3#clean up#this commands may be different in your casehadoop fs -rmr /tmp/wordcount_out/hadoop fs -rmr /tmp/wordcount/
In your TextInputFormat
implementation you're probably overriding createRecordReader
and returning a custom implementation of RecordReader<KEYIN, VALUEIN>
that doesn't take the codec into account. The default implementation returns a LineRecordReader
that handles codecs correctly. You can find a reference implementation here, and the relevant changes required here.