Cannot read bz2 compressed file in hadoop job Cannot read bz2 compressed file in hadoop job hadoop hadoop

Cannot read bz2 compressed file in hadoop job


You should look at your core-site.xml configuration file and add a class for BZip2 codec if it's absent.Here is an example:

<property>    <name>io.compression.codecs</name>    <value>org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.SnappyCodec</value></property>

Edit:

After adding codec please reproduce next steps to see that it works (your code may not):

hadoop fs -mkdir /tmp/wordcount/echo "three one three three seven" >> /tmp/wordsbzip2 -z /tmp/wordshadoop fs -put /tmp/words.bz2 /tmp/wordcount/hadoop jar /usr/lib/hadoop/hadoop-examples.jar wordcount /tmp/wordcount/ /tmp/wordcount_out/hadoop fs -text /tmp/wordcount_out/part*#you should see next three lines:one     1seven   1three   3#clean up#this commands may be different in your casehadoop fs -rmr /tmp/wordcount_out/hadoop fs -rmr /tmp/wordcount/


In your TextInputFormat implementation you're probably overriding createRecordReader and returning a custom implementation of RecordReader<KEYIN, VALUEIN> that doesn't take the codec into account. The default implementation returns a LineRecordReader that handles codecs correctly. You can find a reference implementation here, and the relevant changes required here.