Splitting large XML files into manageble sections for Hadoop Splitting large XML files into manageble sections for Hadoop hadoop hadoop

Splitting large XML files into manageble sections for Hadoop


I think the Cloud9 project at UMD might help you with this.

The library provides has an XMLInputFormat class which might be of use.

Also of interest is this page in the Cloud9 documentation which looks at how you can deal with an XML dump of Wikipedia in MapReduce.