Using Python Iterparse For Large XML Files

python xml lxml large-files elementtree

Try Liza Daly's fast_iter. After processing an element, elem, it calls elem.clear() to remove descendants and also removes preceding siblings.

def fast_iter(context, func, *args, **kwargs):    """    http://lxml.de/parsing.html#modifying-the-tree    Based on Liza Daly's fast_iter    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/    See also http://effbot.org/zone/element-iterparse.htm    """    for event, elem in context:        func(elem, *args, **kwargs)        # It's safe to call clear() here because no descendants will be        # accessed        elem.clear()        # Also eliminate now-empty references from the root node to elem        for ancestor in elem.xpath('ancestor-or-self::*'):            while ancestor.getprevious() is not None:                del ancestor.getparent()[0]    del contextdef process_element(elem):    print elem.xpath( 'description/text( )' )context = etree.iterparse( MYFILE, tag='item' )fast_iter(context,process_element)

Daly's article is an excellent read, especially if you are processing large XML files.

Edit: The fast_iter posted above is a modified version of Daly's fast_iter. After processing an element, it is more aggressive at removing other elements that are no longer needed.

The script below shows the difference in behavior. Note in particular that orig_fast_iter does not delete the A1 element, while the mod_fast_iter does delete it, thus saving more memory.

import lxml.etree as ETimport textwrapimport iodef setup_ABC():    content = textwrap.dedent('''\      <root>        <A1>          <B1></B1>          <C>1<D1></D1></C>          <E1></E1>        </A1>        <A2>          <B2></B2>          <C>2<D></D></C>          <E2></E2>        </A2>      </root>        ''')    return contentdef study_fast_iter():    def orig_fast_iter(context, func, *args, **kwargs):        for event, elem in context:            print('Processing {e}'.format(e=ET.tostring(elem)))            func(elem, *args, **kwargs)            print('Clearing {e}'.format(e=ET.tostring(elem)))            elem.clear()            while elem.getprevious() is not None:                print('Deleting {p}'.format(                    p=(elem.getparent()[0]).tag))                del elem.getparent()[0]        del context    def mod_fast_iter(context, func, *args, **kwargs):        """        http://www.ibm.com/developerworks/xml/library/x-hiperfparse/        Author: Liza Daly        See also http://effbot.org/zone/element-iterparse.htm        """        for event, elem in context:            print('Processing {e}'.format(e=ET.tostring(elem)))            func(elem, *args, **kwargs)            # It's safe to call clear() here because no descendants will be            # accessed            print('Clearing {e}'.format(e=ET.tostring(elem)))            elem.clear()            # Also eliminate now-empty references from the root node to elem            for ancestor in elem.xpath('ancestor-or-self::*'):                print('Checking ancestor: {a}'.format(a=ancestor.tag))                while ancestor.getprevious() is not None:                    print(                        'Deleting {p}'.format(p=(ancestor.getparent()[0]).tag))                    del ancestor.getparent()[0]        del context    content = setup_ABC()    context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')    orig_fast_iter(context, lambda elem: None)    # Processing <C>1<D1/></C>    # Clearing <C>1<D1/></C>    # Deleting B1    # Processing <C>2<D/></C>    # Clearing <C>2<D/></C>    # Deleting B2    print('-' * 80)    """    The improved fast_iter deletes A1. The original fast_iter does not.    """    content = setup_ABC()    context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')    mod_fast_iter(context, lambda elem: None)    # Processing <C>1<D1/></C>    # Clearing <C>1<D1/></C>    # Checking ancestor: root    # Checking ancestor: A1    # Checking ancestor: C    # Deleting B1    # Processing <C>2<D/></C>    # Clearing <C>2<D/></C>    # Checking ancestor: root    # Checking ancestor: A2    # Deleting A1    # Checking ancestor: C    # Deleting B2study_fast_iter()

python xml lxml large-files elementtree

iterparse() lets you do stuff while building the tree, that means that unless you remove what you don't need anymore, you'll still end up with the whole tree in the end.

For more information: read this by the author of the original ElementTree implementation (but it's also applicable to lxml)

python xml lxml large-files elementtree

Why won't you use the "callback" approach of sax?

CodeHunter

Using Python Iterparse For Large XML Files

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last