How to open this XML file to create dataframe in Python?

python xml pandas lxml

XML is a tree-like structure, while a Pandas DataFrame is a 2D table-like structure. So there is no automatic way to convert between the two. You have to understand the XML structure and know how you want to map its data onto a 2D table. Thus, every XML-to-DataFrame problem is different.

Your XML has 2 DataSets, each containing a number of Series. Each Series contains a number of Obs elements.

Each Series has a NAME attribute, and each Obs has OBS_STATUS, TIME_PERIOD and OBS_VALUE attributes. So perhaps it would be reasonable to create a table with NAME, OBS_STATUS, TIME_PERIOD, and OBS_VALUE columns.

I found pulling the desired data out of the XML a bit complicated, which makes me doubtful that I've found the best way to do it. But here is one way (PS. Thomas Maloney's idea of starting with the 2D table-like XLS data should be way simpler):

import lxml.etree as ETimport pandas as pdpath = 'feds200628.xml'def fast_iter(context, func, *args, **kwargs):    """    http://lxml.de/parsing.html#modifying-the-tree    Based on Liza Daly's fast_iter    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/    See also http://effbot.org/zone/element-iterparse.htm    http://stackoverflow.com/a/7171543/190597 (unutbu)    """    for event, elem in context:        func(elem, *args, **kwargs)        # It's safe to call clear() here because no descendants will be        # accessed        elem.clear()        # Also eliminate now-empty references from the root node to elem        for ancestor in elem.xpath('ancestor-or-self::*'):            while ancestor.getprevious() is not None:                del ancestor.getparent()[0]    del contextdata = list()obs_keys = ['OBS_STATUS', 'TIME_PERIOD', 'OBS_VALUE']columns = ['NAME'] + obs_keysdef process_obs(elem, name):    dct = elem.attrib    # print(dct)    data.append([name] + [dct[key] for key in obs_keys])def process_series(elem):    dct = elem.attrib    # print(dct)    context = ET.iterwalk(        elem, events=('end', ),        tag='{http://www.federalreserve.gov/structure/compact/common}Obs'        )    fast_iter(context, process_obs, dct['SERIES_NAME'])def process_dataset(elem):    nsmap = elem.nsmap    # print(nsmap)    context = ET.iterwalk(        elem, events=('end', ),        tag='{{{prefix}}}Series'.format(prefix=elem.nsmap['kf'])        )    fast_iter(context, process_series)with open(path, 'rb') as f:    context = ET.iterparse(        f, events=('end', ),        tag='{http://www.federalreserve.gov/structure/compact/common}DataSet'        )    fast_iter(context, process_dataset)    df = pd.DataFrame(data, columns=columns)

yields

            NAME OBS_STATUS TIME_PERIOD   OBS_VALUE0        SVENY01          A  1961-06-14      2.98251        SVENY01          A  1961-06-15      2.99412        SVENY01          A  1961-06-16      3.00123        SVENY01          A  1961-06-19      2.99494        SVENY01          A  1961-06-20      2.98335        SVENY01          A  1961-06-21      2.99936        SVENY01          A  1961-06-22      2.9837...1029410     TAU2          A  2014-09-19  3.728967791029411     TAU2          A  2014-09-22  3.128361711029412     TAU2          A  2014-09-23  3.201465751029413     TAU2          A  2014-09-24  3.29972110

python xml pandas lxml

I would export the XLS formatted file to a CSV file (using a freely available program like Gnumeric or LibreOffice, or if you have it, Excel), and then read the CSV file into pandas. I know this is not exactly an answer to your final question, but parsing XML is an overly complicated solution to what you're trying to do.

Regarding parsing XML in Python, the lxml library is my favorite library to use. I find using the XPath query language together with an lxml parser to be the best route.

python xml pandas lxml

This code it work to transform to df this type of Excel XML file:

import pandas as pdfrom xml.sax import ContentHandler, parse# Reference https://goo.gl/KaOBG3class ExcelHandler(ContentHandler):    def __init__(self):        self.chars = [  ]        self.cells = [  ]        self.rows = [  ]        self.tables = [  ]    def characters(self, content):        self.chars.append(content)    def startElement(self, name, atts):        if name=="Cell":            self.chars = [  ]        elif name=="Row":            self.cells=[  ]        elif name=="Table":            self.rows = [  ]    def endElement(self, name):        if name=="Cell":            self.cells.append(''.join(self.chars))        elif name=="Row":            self.rows.append(self.cells)        elif name=="Table":            self.tables.append(self.rows)excelHandler = ExcelHandler()parse('feds200628.xls', excelHandler)df1 = pd.DataFrame(excelHandler.tables[0][10:], columns=excelHandler.tables[0][9])print df1.head()

I can't make comment (low reputation), but the answer of this question about "How to open Excel XML file programmatically" (with python and pandas) it should work.

CodeHunter

How to open this XML file to create dataframe in Python?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last