Parallel XML Parsing in Java
This one is obvious: just create several parsers and run them in parallel in multiple threads.
Take a look at Woodstox Performance (down at the moment, try google cache).
This can be done IF structure of your XML is predictable: if it has a lot of same top-level elements. For instance:
<element> <more>more elements</more></element> <element> <other>other elements</other></element>
In this case you could create simple splitter that searches
<element>
and feeds this part to a particular parser instance. That's a simplified approach: in real life I'd go with RandomAccessFile to find start stop points (<element>
) and then create custom FileInputStream that just operates on a part of file.Take a look at Aalto. The same guys that created Woodstox. This are experts in this area - don't reinvent the wheel.
I am agree with Jim. I think that if you want to improve performance of overall processing of 1000 files your plan is good except #3 that is irrelevant in this case.If however you want to improve performance of parsing of single file you have a problem. I do not know how it is possible to split XML file without it parsing. Each chunk will be illegal XML and your parser will fail.
I believe that improving overall time is good enough for you. In this case read this tutorial:http://download.oracle.com/javase/tutorial/essential/concurrency/index.htmlthen create thread pool of for example 100 threads and queue that contains XML sources. Each thread will parse only 10 files that will bring serious performance benefit in multi-CPU environment.
In addition to existing good suggestions there is one rather simple thing to do: use cursor API (XMLStreamReader), NOT Event API. Event API adds 30-50% overhead without (just IMO) significantly making processing easire. In fact, if you want convenience, I would recommend using StaxMate instead; it builds on top of Cursor API without adding significant overhead (at most 5-10% compared to hand-written code).
Now: I assume you have done basic optimizations with Woodstox; but if not, check out "3 Simple Rules for Fast XML-processing using Stax". Specifically, you absolutely should:
- Make sure you only create XMLInputFactory and XMLOutputFactory instances once
- Close readers and writers to ensure buffer recycling (and other useful reuse) works as expected.
The reason I mention this is that while these make no functional difference (code works as expected) they can make big performance difference; although more so when processing smaller files.
Running multiple instances does also make sense; although usually with at most 1 thread per core. However you will only get benefit as long as your storage I/O can support such speeds; if disk is the bottleneck this will not help and can in some cases hurt (if disk seeks compete). But it is worth a try.