How to read compressed bz2 (bzip2) Wikipedia dumps into stream xml record reader for hadoop map reduce How to read compressed bz2 (bzip2) Wikipedia dumps into stream xml record reader for hadoop map reduce hadoop hadoop

How to read compressed bz2 (bzip2) Wikipedia dumps into stream xml record reader for hadoop map reduce


The Wikimedia Foundation just released an InputReader for the Hadoop Streaming interface that is able to read the bz2 compressed full dump files and send it to your mappers. The unit being send to a mapper is not a whole page but two revisions (so you can actually run a diff on the two revisions). This is the initial release and I am sure there will be some bugs but please give it a spin and help us test it.

This InputReader requires Hadoop 0.21 as Hadoop 0.21 has streaming support for bz2 files. The source code is available at: https://github.com/whym/wikihadoop


Your problem is the same as described here. So my answer is the same too You should create your own variation on TextInputFormat. In there you make a new RecordReader that skips lines until it sees the start of a logical line.