How to convert an XML file to nice pandas dataframe?
You can easily use xml
(from the Python standard library) to convert to a pandas.DataFrame
. Here's what I would do (when reading from a file replace xml_data
with the name of your file or file object):
import pandas as pdimport xml.etree.ElementTree as ETimport iodef iter_docs(author): author_attr = author.attrib for doc in author.iter('document'): doc_dict = author_attr.copy() doc_dict.update(doc.attrib) doc_dict['data'] = doc.text yield doc_dictxml_data = io.StringIO(u'''YOUR XML STRING HERE''')etree = ET.parse(xml_data) #create an ElementTree object doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))
If there are multiple authors in your original document or the root of your XML is not an author
, then I would add the following generator:
def iter_author(etree): for author in etree.iter('author'): for row in iter_docs(author): yield row
and change doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))
to doc_df = pd.DataFrame(list(iter_author(etree)))
Have a look at the ElementTree
tutorial provided in the xml
library documentation.
Here is another way of converting a xml to pandas data frame. For example i have parsing xml from a string but this logic holds good from reading file as well.
import pandas as pdimport xml.etree.ElementTree as ETxml_str = '<?xml version="1.0" encoding="utf-8"?>\n<response>\n <head>\n <code>\n 200\n </code>\n </head>\n <body>\n <data id="0" name="All Categories" t="2018052600" tg="1" type="category"/>\n <data id="13" name="RealEstate.com.au [H]" t="2018052600" tg="1" type="publication"/>\n </body>\n</response>'etree = ET.fromstring(xml_str)dfcols = ['id', 'name']df = pd.DataFrame(columns=dfcols)for i in etree.iter(tag='data'): df = df.append( pd.Series([i.get('id'), i.get('name')], index=dfcols), ignore_index=True)df.head()