using lxml and iterparse() to parse a big (+- 1Gb) XML file

python xml parsing lxml iterparse

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):  for child in element:    print(child.tag, child.text)    element.clear()

the final clear will stop you from using too much memory.

[update:] to get "everything between ... as a string" i guess you want one of:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):  print(etree.tostring(element))  element.clear()

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):  print(''.join([etree.tostring(child) for child in element]))  element.clear()

or perhaps even:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):  print(''.join([child.text for child in element]))  element.clear()

python xml parsing lxml iterparse

For future searchers: The top answer here suggests clearing the element on each iteration, but that still leaves you with an ever-increasing set of empty elements that will slowly build up in memory:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):  for child in element:    print(child.tag, child.text)    element.clear()

^ This is not a scalable solution, especially as your source file gets larger and larger. The better solution is to get the root element, and clear that every time you load a complete record. This will keep memory usage pretty stable (sub-20MB I would say).

Here's a solution that doesn't require looking for a specific tag. This function will return a generator that yields all 1st child nodes (e.g. <BlogPost> elements) underneath the root node (e.g. <Database>). It does this by recording the start of the first tag after the root node, then waiting for the corresponding end tag, yielding the entire element, and then clearing the root node.

from lxml import etreexmlfile = '/path/to/xml/file.xml'def iterate_xml(xmlfile):    doc = etree.iterparse(xmlfile, events=('start', 'end'))    _, root = next(doc)    start_tag = None    for event, element in doc:        if event == 'start' and start_tag is None:            start_tag = element.tag        if event == 'end' and element.tag == start_tag:            yield element            start_tag = None            root.clear()

python xml parsing lxml iterparse

I prefer XPath for such things:

In [1]: from lxml.etree import parseIn [2]: tree = parse('/tmp/database.xml')In [3]: for post in tree.xpath('/Database/BlogPost'):   ...:     print 'Author:', post.xpath('Author')[0].text   ...:     print 'Content:', post.xpath('Content')[0].text   ...: Author: Last Name, NameContent: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.Author: Last Name, NameContent: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.Author: Last Name, NameContent: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.

I'm not sure if it's different in terms of processing big files, though. Comments about this would be appreciated.

Doing it your way,

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):     for info in element.iter():         if info.tag in ('Author', 'Content'):             print info.tag, ':', info.text

CodeHunter

using lxml and iterparse() to parse a big (+- 1Gb) XML file

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last