How to convert an XML file to nice pandas dataframe? How to convert an XML file to nice pandas dataframe? pandas pandas

How to convert an XML file to nice pandas dataframe?


You can easily use xml (from the Python standard library) to convert to a pandas.DataFrame. Here's what I would do (when reading from a file replace xml_data with the name of your file or file object):

import pandas as pdimport xml.etree.ElementTree as ETimport iodef iter_docs(author):    author_attr = author.attrib    for doc in author.iter('document'):        doc_dict = author_attr.copy()        doc_dict.update(doc.attrib)        doc_dict['data'] = doc.text        yield doc_dictxml_data = io.StringIO(u'''YOUR XML STRING HERE''')etree = ET.parse(xml_data) #create an ElementTree object doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))

If there are multiple authors in your original document or the root of your XML is not an author, then I would add the following generator:

def iter_author(etree):    for author in etree.iter('author'):        for row in iter_docs(author):            yield row

and change doc_df = pd.DataFrame(list(iter_docs(etree.getroot()))) to doc_df = pd.DataFrame(list(iter_author(etree)))

Have a look at the ElementTree tutorial provided in the xml library documentation.


Here is another way of converting a xml to pandas data frame. For example i have parsing xml from a string but this logic holds good from reading file as well.

import pandas as pdimport xml.etree.ElementTree as ETxml_str = '<?xml version="1.0" encoding="utf-8"?>\n<response>\n <head>\n  <code>\n   200\n  </code>\n </head>\n <body>\n  <data id="0" name="All Categories" t="2018052600" tg="1" type="category"/>\n  <data id="13" name="RealEstate.com.au [H]" t="2018052600" tg="1" type="publication"/>\n </body>\n</response>'etree = ET.fromstring(xml_str)dfcols = ['id', 'name']df = pd.DataFrame(columns=dfcols)for i in etree.iter(tag='data'):    df = df.append(        pd.Series([i.get('id'), i.get('name')], index=dfcols),        ignore_index=True)df.head()


As of v1.3, you can simply use:

pandas.read_xml(path_or_file)

You can install the latest dev release of pandas with:

pip install git+https://github.com/pandas-dev/pandas