Parallel XML Parsing in Java Parallel XML Parsing in Java xml xml

Parallel XML Parsing in Java


  1. This one is obvious: just create several parsers and run them in parallel in multiple threads.

  2. Take a look at Woodstox Performance (down at the moment, try google cache).

  3. This can be done IF structure of your XML is predictable: if it has a lot of same top-level elements. For instance:

    <element>    <more>more elements</more></element> <element>    <other>other elements</other></element>

    In this case you could create simple splitter that searches <element> and feeds this part to a particular parser instance. That's a simplified approach: in real life I'd go with RandomAccessFile to find start stop points (<element>) and then create custom FileInputStream that just operates on a part of file.

  4. Take a look at Aalto. The same guys that created Woodstox. This are experts in this area - don't reinvent the wheel.


I am agree with Jim. I think that if you want to improve performance of overall processing of 1000 files your plan is good except #3 that is irrelevant in this case.If however you want to improve performance of parsing of single file you have a problem. I do not know how it is possible to split XML file without it parsing. Each chunk will be illegal XML and your parser will fail.

I believe that improving overall time is good enough for you. In this case read this tutorial:http://download.oracle.com/javase/tutorial/essential/concurrency/index.htmlthen create thread pool of for example 100 threads and queue that contains XML sources. Each thread will parse only 10 files that will bring serious performance benefit in multi-CPU environment.


In addition to existing good suggestions there is one rather simple thing to do: use cursor API (XMLStreamReader), NOT Event API. Event API adds 30-50% overhead without (just IMO) significantly making processing easire. In fact, if you want convenience, I would recommend using StaxMate instead; it builds on top of Cursor API without adding significant overhead (at most 5-10% compared to hand-written code).

Now: I assume you have done basic optimizations with Woodstox; but if not, check out "3 Simple Rules for Fast XML-processing using Stax". Specifically, you absolutely should:

  1. Make sure you only create XMLInputFactory and XMLOutputFactory instances once
  2. Close readers and writers to ensure buffer recycling (and other useful reuse) works as expected.

The reason I mention this is that while these make no functional difference (code works as expected) they can make big performance difference; although more so when processing smaller files.

Running multiple instances does also make sense; although usually with at most 1 thread per core. However you will only get benefit as long as your storage I/O can support such speeds; if disk is the bottleneck this will not help and can in some cases hurt (if disk seeks compete). But it is worth a try.