How to Parse Big (50 GB) XML Files in Java
Your parsing code is likely working fine, but the volume of data you're loading is probably just too large to hold in memory in that ArrayList
.
You need some sort of pipeline to pass the data on to its actual destination without ever store it all in memory at once.
What I've sometimes done for this sort of situation is similar to the following.
Create an interface for processing a single element:
public interface PageProcessor { void process(Page page);}
Supply an implementation of this to the PageHandler
through a constructor:
public class Read { public static void main(String[] args) { XMLManager.load(new PageProcessor() { @Override public void process(Page page) { // Obviously you want to do something other than just printing, // but I don't know what that is... System.out.println(page); } }) ; }}public class XMLManager { public static void load(PageProcessor processor) { SAXParserFactory factory = SAXParserFactory.newInstance(); try { SAXParser parser = factory.newSAXParser(); File file = new File("pages-articles.xml"); PageHandler pageHandler = new PageHandler(processor); parser.parse(file, pageHandler); } catch (ParserConfigurationException e) { e.printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } }}
Send data to this processor instead of putting it in the list:
public class PageHandler extends DefaultHandler { private final PageProcessor processor; private Page page; private StringBuilder stringBuilder; private boolean idSet = false; public PageHandler(PageProcessor processor) { this.processor = processor; } @Override public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException { //Unchanged from your implementation } @Override public void characters(char[] ch, int start, int length) throws SAXException { //Unchanged from your implementation } @Override public void endElement(String uri, String localName, String qName) throws SAXException { // Elide code not needing change } else if (qName.equals("page")){ processor.process(page); page = null; } } else { page = null; } }}
Of course, you can make your interface handle chunks of multiple records rather than just one and have the PageHandler
collect pages locally in a smaller list and periodically send the list off for processing and clear the list.
Or (perhaps better) you could implement the PageProcessor
interface as defined here and build in logic there that buffers the data and sends it on for further handling in chunks.
Don Roby's approach is somewhat reminiscent to the approach I followed creating a code generator designed to solve this particular problem (an early version was conceived in 2008). Basically each complexType
has its Java POJO
equivalent and handlers for the particular type are activated when the context changes to that element. I used this approach for SEPA, transaction banking and for instance discogs (30GB). You can specify what elements you want to process at runtime, declaratively using a propeties file.
XML2J uses mapping of complexTypes
to Java POJOs on the one hand, but lets you specify events you want to listen on.E.g.
account/@process = trueaccount/accounts/@process = trueaccount/accounts/@detach = true
The essence is in the third line. The detach makes sure individual accounts are not added to the accounts list. So it won't overflow.
class AccountType { private List<AccountType> accounts = new ArrayList<>(); public void addAccount(AccountType tAccount) { accounts.add(tAccount); } // etc.};
In your code you need to implement the process method (by default the code generator generates an empty method:
class AccountsProcessor implements MessageProcessor { static private Logger logger = LoggerFactory.getLogger(AccountsProcessor.class); // assuming Spring data persistency here final String path = new ClassPathResource("spring-config.xml").getPath(); ClassPathXmlApplicationContext context = new ClassPathXmlApplicationContext(path); AccountsTypeRepo repo = context.getBean(AccountsTypeRepo.class); @Override public void process(XMLEvent evt, ComplexDataType data) throws ProcessorException { if (evt == XMLEvent.END) { if( data instanceof AccountType) { process((AccountType)data); } } } private void process(AccountType data) { if (logger.isInfoEnabled()) { // do some logging } repo.save(data); }}
Note that XMLEvent.END
marks the closing tag of an element. So, when you are processing it, it is complete. If you have to relate it (using a FK) to its parent object in the database, you could process the XMLEvent.BEGIN
for the parent, create a placeholder in the database and use its key to store with each of its children. In the final XMLEvent.END
you would then update the parent.
Note that the code generator generates everything you need. You just have to implement that method and of course the DB glue code.
There are samples to get you started. The code generator even generates your POM files, so you can immediately after generation build your project.
The default process method is like this:
@Overridepublic void process(XMLEvent evt, ComplexDataType data) throws ProcessorException {/* * TODO Auto-generated method stub implement your own handling here. * Use the runtime configuration file to determine which events are to be sent to the processor. */ if (evt == XMLEvent.END) { data.print( ConsoleWriter.out ); }}
Downloads:
- https://github.com/lolkedijkstra/xml2j-core
- https://github.com/lolkedijkstra/xml2j-gen
- https://sourceforge.net/projects/xml2j/
First mvn clean install
the core (it has to be in the local maven repo), then the generator. And don't forget to set up the environment variable XML2J_HOME
as per directions in the usermanual.