Parse XML file into Python object Parse XML file into Python object python python

Parse XML file into Python object


My beloved SD Chargers hat is off to you if you think a regex is easier than this:

#!/usr/bin/env pythonimport xml.etree.cElementTree as etsxml="""<encspot>  <file>   <Name>some filename.mp3</Name>   <Encoder>Gogo (after 3.0)</Encoder>   <Bitrate>131</Bitrate>  </file>  <file>   <Name>another filename.mp3</Name>   <Encoder>iTunes</Encoder>   <Bitrate>128</Bitrate>    </file></encspot>"""tree=et.fromstring(sxml)for el in tree.findall('file'):    print '-------------------'    for ch in el.getchildren():        print '{:>15}: {:<30}'.format(ch.tag, ch.text) print "\nan alternate way:"  el=tree.find('file[2]/Name')  # xpathprint '{:>15}: {:<30}'.format(el.tag, el.text)  

Output:

-------------------           Name: some filename.mp3                     Encoder: Gogo (after 3.0)                      Bitrate: 131                           -------------------           Name: another filename.mp3                  Encoder: iTunes                                Bitrate: 128                           an alternate way:           Name: another filename.mp3  

If your attraction to a regex is being terse, here is an equally incomprehensible bit of list comprehension to create a data structure:

[(ch.tag,ch.text) for e in tree.findall('file') for ch in e.getchildren()]

Which creates a list of tuples of the XML children of <file> in document order:

[('Name', 'some filename.mp3'),  ('Encoder', 'Gogo (after 3.0)'),  ('Bitrate', '131'),  ('Name', 'another filename.mp3'),  ('Encoder', 'iTunes'),  ('Bitrate', '128')]

With a few more lines and a little more thought, obviously, you can create any data structure that you want from XML with ElementTree. It is part of the Python distribution.

Edit

Code golf is on!

[{item.tag: item.text for item in ch} for ch in tree.findall('file')] [ {'Bitrate': '131',    'Name': 'some filename.mp3',    'Encoder': 'Gogo (after 3.0)'},   {'Bitrate': '128',    'Name': 'another filename.mp3',    'Encoder': 'iTunes'}]

If your XML only has the file section, you can choose your golf. If your XML has other tags, other sections, you need to account for the section the children are in and you will need to use findall

There is a tutorial on ElementTree at Effbot.org


Use ElementTree. You don't need/want to muck about with a parse-only gadget like pyexpat ... you'd only end up re-inventing ElementTree partially and poorly.

Another possibility is lxml which is a third-party package which implements the ElementTree interface plus more.

Update Someone started playing code-golf; here's my entry, which actually creates the data structure you asked for:

# xs = """<encspot> etc etc </encspot""">>> import xml.etree.cElementTree as et>>> from pprint import pprint as pp>>> pp([dict((attr.tag, attr.text) for attr in el) for el in et.fromstring(xs)])[{'Bitrate': '131',  'Encoder': 'Gogo (after 3.0)',  'Frame': 'no',  'Frames': '6255',  'Freq.': '44100',  'Length': '00:02:43',  'Mode': 'joint stereo',  'Name': 'some filename.mp3',  'Quality': 'good',  'Size': '5,236,644'}, {'Bitrate': '0', 'Name': 'foo.mp3'}]>>>

You'd probably want to have a dict mapping "attribute" names to conversion functions:

converters = {    'Frames': int,    'Size': lambda x: int(x.replace(',', '')),    # etc    }


I have also been looking for a simple way to transform data between XML documents and Python data structures, something similar to Golang's XML library which allows you to declaratively specify how to map from data structures to XML.

I was unable to find such a library for Python, so I wrote one to meet my need called declxml for declarative XML processing.

With declxml, you create processors which declaratively define the structure of your XML document. Processors are used to perform both parsing and serialization as well as a basic level of validation.

Parsing this XML data into a list of dictionaries with declxml is straightforward

import declxml as xmlxml_string = """<encspot>  <file>   <Name>some filename.mp3</Name>   <Encoder>Gogo (after 3.0)</Encoder>   <Bitrate>131</Bitrate>  </file>  <file>   <Name>another filename.mp3</Name>   <Encoder>iTunes</Encoder>   <Bitrate>128</Bitrate>    </file></encspot>"""processor = xml.dictionary('encspot', [    xml.array(xml.dictionary('file', [        xml.string('Name'),        xml.string('Encoder'),        xml.integer('Bitrate')    ]), alias='files')])xml.parse_from_string(processor, xml_string)

Which produces the following result

{'files': [  {'Bitrate': 131, 'Encoder': 'Gogo (after 3.0)', 'Name': 'some filename.mp3'},  {'Bitrate': 128, 'Encoder': 'iTunes', 'Name': 'another filename.mp3'}]}

Want to parse the data into objects instead of dictionaries? You can do that as well

import declxml as xmlclass AudioFile:    def __init__(self):        self.name = None        self.encoder = None        self.bit_rate = None    def __repr__(self):        return 'AudioFile(name={}, encoder={}, bit_rate={})'.format(            self.name, self.encoder, self.bit_rate)processor = xml.array(xml.user_object('file', AudioFile, [    xml.string('Name', alias='name'),    xml.string('Encoder', alias='encoder'),    xml.integer('Bitrate', alias='bit_rate')]), nested='encspot')xml.parse_from_string(processor, xml_string)

Which produces the output

[AudioFile(name=some filename.mp3, encoder=Gogo (after 3.0), bit_rate=131), AudioFile(name=another filename.mp3, encoder=iTunes, bit_rate=128)]