What is a good XML stream parser for Python? [closed]
Here's good answer about xml.etree.ElementTree.iterparse
practice on huge XML files. lxml
has the method as well. The key to stream parsing with iterparse
is manual clearing and removing already processed nodes, because otherwise you will end up running out of memory.
Another option is using xml.sax
. The official manual is too formal to me, and lacks examples so it needs clarification along with the question. Default parser module, xml.sax.expatreader
, implement incremental parsing interface xml.sax.xmlreader.IncrementalParser
. That is to say xml.sax.make_parser()
provides suitable stream parser.
For instance, given a XML stream like:
<?xml version="1.0" encoding="utf-8"?><root> <entry><a>value 0</a><b foo='bar' /></entry> <entry><a>value 1</a><b foo='baz' /></entry> <entry><a>value 2</a><b foo='quz' /></entry> ...</root>
Can be handled in the following way.
#!/usr/bin/env python# -*- coding: utf-8 -*-import xml.saxclass StreamHandler(xml.sax.handler.ContentHandler): lastEntry = None lastName = None def startElement(self, name, attrs): self.lastName = name if name == 'entry': self.lastEntry = {} elif name != 'root': self.lastEntry[name] = {'attrs': attrs, 'content': ''} def endElement(self, name): if name == 'entry': print({ 'a' : self.lastEntry['a']['content'], 'b' : self.lastEntry['b']['attrs'].getValue('foo') }) self.lastEntry = None elif name == 'root': raise StopIteration def characters(self, content): if self.lastEntry: self.lastEntry[self.lastName]['content'] += contentif __name__ == '__main__': # use default ``xml.sax.expatreader`` parser = xml.sax.make_parser() parser.setContentHandler(StreamHandler()) # feed the parser with small chunks to simulate with open('data.xml') as f: while True: buffer = f.read(16) if buffer: try: parser.feed(buffer) except StopIteration: break # if you can provide a file-like object it's as simple as with open('data.xml') as f: parser.parse(f)