How to convert an XML file to nice pandas dataframe?

python xml pandas dataframe parsing

You can easily use xml (from the Python standard library) to convert to a pandas.DataFrame. Here's what I would do (when reading from a file replace xml_data with the name of your file or file object):

import pandas as pdimport xml.etree.ElementTree as ETimport iodef iter_docs(author):    author_attr = author.attrib    for doc in author.iter('document'):        doc_dict = author_attr.copy()        doc_dict.update(doc.attrib)        doc_dict['data'] = doc.text        yield doc_dictxml_data = io.StringIO(u'''YOUR XML STRING HERE''')etree = ET.parse(xml_data) #create an ElementTree object doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))

If there are multiple authors in your original document or the root of your XML is not an author, then I would add the following generator:

def iter_author(etree):    for author in etree.iter('author'):        for row in iter_docs(author):            yield row

and change doc_df = pd.DataFrame(list(iter_docs(etree.getroot()))) to doc_df = pd.DataFrame(list(iter_author(etree)))

Have a look at the ElementTree tutorial provided in the xml library documentation.

python xml pandas dataframe parsing

Here is another way of converting a xml to pandas data frame. For example i have parsing xml from a string but this logic holds good from reading file as well.

import pandas as pdimport xml.etree.ElementTree as ETxml_str = '<?xml version="1.0" encoding="utf-8"?>\n<response>\n <head>\n  <code>\n   200\n  </code>\n </head>\n <body>\n  <data id="0" name="All Categories" t="2018052600" tg="1" type="category"/>\n  <data id="13" name="RealEstate.com.au [H]" t="2018052600" tg="1" type="publication"/>\n </body>\n</response>'etree = ET.fromstring(xml_str)dfcols = ['id', 'name']df = pd.DataFrame(columns=dfcols)for i in etree.iter(tag='data'):    df = df.append(        pd.Series([i.get('id'), i.get('name')], index=dfcols),        ignore_index=True)df.head()

python xml pandas dataframe parsing

As of v1.3, you can simply use:

pandas.read_xml(path_or_file)

^{You can install the latest dev release of pandas with:
pip install git+https://github.com/pandas-dev/pandas}

CodeHunter

How to convert an XML file to nice pandas dataframe?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last