Parsing large XML documents in JAVA Parsing large XML documents in JAVA xml xml

Parsing large XML documents in JAVA


SAX (Simple API for XML) will help you here.

Unlike the DOM parser, the SAX parser does not create an in-memory representation of the XML document and so is faster and uses less memory. Instead, the SAX parser informs clients of the XML document structure by invoking callbacks, that is, by invoking methods on a org.xml.sax.helpers.DefaultHandler instance provided to the parser.

Here is an example implementation:

SAXParser parser = SAXParserFactory.newInstance().newSAXParser();DefaultHandler handler = new MyHandler();parser.parse("file.xml", handler);

Where in MyHandler you define the actions to be taken when events like start/end of document/element are generated.

class MyHandler extends DefaultHandler {    @Override    public void startDocument() throws SAXException {    }    @Override    public void endDocument() throws SAXException {    }    @Override    public void startElement(String uri, String localName, String qName,            Attributes attributes) throws SAXException {    }    @Override    public void endElement(String uri, String localName, String qName)            throws SAXException {    }    // To take specific actions for each chunk of character data (such as    // adding the data to a node or buffer, or printing it to a file).    @Override    public void characters(char ch[], int start, int length)            throws SAXException {    }}


If you don't want to be bound by the memory limits, I certainly recommend you to use your current approach, and store everything in database.

The parsing of the XML file should be done by a SAX parser, as everybody has recommended (including me). This way you can create one object at a time, and you can immediately persist it into the database.

For the post-processing (resolving cross-references), you can use SELECTs from the database, make primary keys, indexes, etc. You can use ORM (Eclipselink, Hibernate) as well if you feel comfortable with that.

Actually I don't really recommend SQLite, it's easier to set up a MySQL server, and store the data there. Later you can even reuse the XML data (if you don't delete).


If you want to use a higher-level approach than SAX, which can be very tricky to program, you could look at streaming XSLT transformations using a recent Saxon-EE release. However, you've been too vague about the precise processing that you are doing to know whether this will work for your particular case.