How to Parse Big (50 GB) XML Files in Java How to Parse Big (50 GB) XML Files in Java xml xml

How to Parse Big (50 GB) XML Files in Java


Your parsing code is likely working fine, but the volume of data you're loading is probably just too large to hold in memory in that ArrayList.

You need some sort of pipeline to pass the data on to its actual destination without ever store it all in memory at once.

What I've sometimes done for this sort of situation is similar to the following.

Create an interface for processing a single element:

public interface PageProcessor {    void process(Page page);}

Supply an implementation of this to the PageHandler through a constructor:

public class Read  {    public static void main(String[] args) {        XMLManager.load(new PageProcessor() {            @Override            public void process(Page page) {                // Obviously you want to do something other than just printing,                 // but I don't know what that is...                System.out.println(page);           }        }) ;    }}public class XMLManager {    public static void load(PageProcessor processor) {        SAXParserFactory factory = SAXParserFactory.newInstance();        try {            SAXParser parser = factory.newSAXParser();            File file = new File("pages-articles.xml");            PageHandler pageHandler = new PageHandler(processor);            parser.parse(file, pageHandler);        } catch (ParserConfigurationException e) {            e.printStackTrace();        } catch (SAXException e) {            e.printStackTrace();        } catch (IOException e) {            e.printStackTrace();        }    }}

Send data to this processor instead of putting it in the list:

public class PageHandler extends DefaultHandler {    private final PageProcessor processor;    private Page page;    private StringBuilder stringBuilder;    private boolean idSet = false;    public PageHandler(PageProcessor processor) {        this.processor = processor;    }    @Override    public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {         //Unchanged from your implementation    }    @Override    public void characters(char[] ch, int start, int length) throws SAXException {         //Unchanged from your implementation    }    @Override    public void endElement(String uri, String localName, String qName) throws SAXException {            //  Elide code not needing change            } else if (qName.equals("page")){                processor.process(page);                page = null;            }        } else {            page = null;        }    }}

Of course, you can make your interface handle chunks of multiple records rather than just one and have the PageHandler collect pages locally in a smaller list and periodically send the list off for processing and clear the list.

Or (perhaps better) you could implement the PageProcessor interface as defined here and build in logic there that buffers the data and sends it on for further handling in chunks.


Don Roby's approach is somewhat reminiscent to the approach I followed creating a code generator designed to solve this particular problem (an early version was conceived in 2008). Basically each complexType has its Java POJO equivalent and handlers for the particular type are activated when the context changes to that element. I used this approach for SEPA, transaction banking and for instance discogs (30GB). You can specify what elements you want to process at runtime, declaratively using a propeties file.

XML2J uses mapping of complexTypes to Java POJOs on the one hand, but lets you specify events you want to listen on.E.g.

account/@process = trueaccount/accounts/@process = trueaccount/accounts/@detach = true

The essence is in the third line. The detach makes sure individual accounts are not added to the accounts list. So it won't overflow.

class AccountType {    private List<AccountType> accounts = new ArrayList<>();    public void addAccount(AccountType tAccount) {        accounts.add(tAccount);    }    // etc.};

In your code you need to implement the process method (by default the code generator generates an empty method:

class AccountsProcessor implements MessageProcessor {    static private Logger logger = LoggerFactory.getLogger(AccountsProcessor.class);    // assuming Spring data persistency here    final String path = new ClassPathResource("spring-config.xml").getPath();    ClassPathXmlApplicationContext context = new   ClassPathXmlApplicationContext(path);    AccountsTypeRepo repo = context.getBean(AccountsTypeRepo.class);    @Override    public void process(XMLEvent evt, ComplexDataType data)        throws ProcessorException {        if (evt == XMLEvent.END) {            if( data instanceof AccountType) {                process((AccountType)data);            }        }    }    private void process(AccountType data) {        if (logger.isInfoEnabled()) {            // do some logging        }        repo.save(data);    }}   

Note that XMLEvent.END marks the closing tag of an element. So, when you are processing it, it is complete. If you have to relate it (using a FK) to its parent object in the database, you could process the XMLEvent.BEGIN for the parent, create a placeholder in the database and use its key to store with each of its children. In the final XMLEvent.END you would then update the parent.

Note that the code generator generates everything you need. You just have to implement that method and of course the DB glue code.

There are samples to get you started. The code generator even generates your POM files, so you can immediately after generation build your project.

The default process method is like this:

@Overridepublic void process(XMLEvent evt, ComplexDataType data)    throws ProcessorException {/* *  TODO Auto-generated method stub implement your own handling here. *  Use the runtime configuration file to determine which events are to be sent to the processor. */     if (evt == XMLEvent.END) {        data.print( ConsoleWriter.out );    }}

Downloads:

First mvn clean install the core (it has to be in the local maven repo), then the generator. And don't forget to set up the environment variable XML2J_HOME as per directions in the usermanual.