What is the fastest way to parse large XML docs in Python?

python xml performance parsing

I looks to me as if you do not need any DOM capabilities from your program. I would second the use of the (c)ElementTree library. If you use the iterparse function of the cElementTree module, you can work your way through the xml and deal with the events as they occur.

Note however, Fredriks advice on using cElementTree iterparse function:

to parse large files, you can get rid of elements as soon as you’ve processed them:

for event, elem in iterparse(source):    if elem.tag == "record":        ... process record elements ...        elem.clear()

The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:

# get an iterablecontext = iterparse(source, events=("start", "end"))# turn it into an iteratorcontext = iter(context)# get the root elementevent, root = context.next()for event, elem in context:    if event == "end" and elem.tag == "record":        ... process record elements ...        root.clear()

The lxml.iterparse() does not allow this.

The previous does not work on Python 3.7, consider the following way to get the first element.

import xml.etree.ElementTree as ET# Get an iterable.context = ET.iterparse(source, events=("start", "end"))    for index, (event, elem) in enumerate(context):    # Get the root element.    if index == 0:        root = elem    if event == "end" and elem.tag == "record":        # ... process record elements ...        root.clear()

python xml performance parsing

Have you tried The cElementTree Module?

cElementTree is included with Python 2.5 and later, as xml.etree.cElementTree. Refer the benchmarks.

removed dead ImageShack link

python xml performance parsing

I recommend you to use lxml, it's a python binding for the libxml2 library which is really fast.

In my experience, libxml2 and expat have very similar performance. But I prefer libxml2 (and lxml for python) because it seems to be more actively developed and tested. Also libxml2 has more features.

lxml is mostly API compatible with xml.etree.ElementTree. And there is good documentation in its web site.

CodeHunter

What is the fastest way to parse large XML docs in Python?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last