Loading huge XML files and dealing with MemoryError Loading huge XML files and dealing with MemoryError xml xml

Loading huge XML files and dealing with MemoryError


Do not use BeautifulSoup to try and such a large parse XML file. Use the ElementTree API instead. Specifically, use the iterparse() function to parse your file as a stream, handle information as you are notified of elements, then delete the elements again:

from xml.etree import ElementTree as ETparser = ET.iterparse(filename)for event, element in parser:    # element is a whole element    if element.tag == 'yourelement'         # do something with this element         # then clean up         element.clear()

By using a event-driven approach, you never need to hold the whole XML document in memory, you only extract what you need and discard the rest.

See the iterparse() tutorial and documentation.

Alternatively, you can also use the lxml library; it offers the same API in a faster and more featurefull package.