How to read compressed bz2 (bzip2) Wikipedia dumps into stream xml record reader for hadoop map reduce
The Wikimedia Foundation just released an InputReader for the Hadoop Streaming interface that is able to read the bz2 compressed full dump files and send it to your mappers. The unit being send to a mapper is not a whole page but two revisions (so you can actually run a diff on the two revisions). This is the initial release and I am sure there will be some bugs but please give it a spin and help us test it.
This InputReader requires Hadoop 0.21 as Hadoop 0.21 has streaming support for bz2 files. The source code is available at: https://github.com/whym/wikihadoop