Why is lxml.etree.iterparse() eating up all my memory?

python xml memory lxml iterparse

As iterparse iterates over the entire file a tree is built and no elements are freed. The advantage of doing this is that the elements remember who their parent is, and you can form XPaths that refer to ancestor elements. The disadvantage is that it can consume a lot of memory.

In order to free some memory as you parse, use Liza Daly's fast_iter:

def fast_iter(context, func, *args, **kwargs):    """    http://lxml.de/parsing.html#modifying-the-tree    Based on Liza Daly's fast_iter    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/    See also http://effbot.org/zone/element-iterparse.htm    """    for event, elem in context:        func(elem, *args, **kwargs)        # It's safe to call clear() here because no descendants will be        # accessed        elem.clear()        # Also eliminate now-empty references from the root node to elem        for ancestor in elem.xpath('ancestor-or-self::*'):            while ancestor.getprevious() is not None:                del ancestor.getparent()[0]    del context

which you could then use like this:

def process_element(elem):    print "why does this consume all my memory?"context = lxml.etree.iterparse('really-big-file.xml', tag='schedule', events = ('end', ))fast_iter(context, process_element)

I highly recommend the article on which the above fast_iter is based; it should be especially interesting to you if you are dealing with large XML files.

The fast_iter presented above is a slightly modified version of the one shownin the article. This one is more aggressive about deleting previous ancestors,thus saves more memory. Here you'll find a script which demonstrates thedifference.

python xml memory lxml iterparse

Directly copied from http://effbot.org/zone/element-iterparse.htm

Note that iterparse still builds a tree, just like parse, but you can safely rearrange or remove parts of the tree while parsing. For example, to parse large files, you can get rid of elements as soon as you’ve processed them:

for event, elem in iterparse(source):    if elem.tag == "record":        ... process record elements ...        elem.clear()

The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:

# get an iterablecontext = iterparse(source, events=("start", "end"))# turn it into an iteratorcontext = iter(context)# get the root elementevent, root = context.next()for event, elem in context:    if event == "end" and elem.tag == "record":        ... process record elements ...        root.clear()

python xml memory lxml iterparse

This worked really well for me:

def destroy_tree(tree):    root = tree.getroot()    node_tracker = {root: [0, None]}    for node in root.iterdescendants():        parent = node.getparent()        node_tracker[node] = [node_tracker[parent][0] + 1, parent]    node_tracker = sorted([(depth, parent, child) for child, (depth, parent)                           in node_tracker.items()], key=lambda x: x[0], reverse=True)    for _, parent, child in node_tracker:        if parent is None:            break        parent.remove(child)    del tree

CodeHunter

Why is lxml.etree.iterparse() eating up all my memory?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last