Pandas read_xml() method test strategies Pandas read_xml() method test strategies python python

Pandas read_xml() method test strategies


PERFORMANCE: How do you explain the slower iterparse often recommended for larger files as file is iteratively parsed? Is it partly due to the if logic checks?

I would assume that more python code would make it slower, as the python code is evaluated every time. Have you tried a JIT compiler like pypy?

If I remove i and use first_tag only, it seems to be quite a bit faster, so yes it is partly due to the if logic checks:

def read_xml_iterparse2(path):    data = []    inner = {}    first_tag = None    for (ev, el) in et.iterparse(path):        if not first_tag:           first_tag = el.tag        if el.tag == first_tag and len(inner) != 0:            data.append(inner)                        inner = {}        if el.text is not None and len(el.text.strip()) > 0:            inner[el.tag] = el.text    df = pd.DataFrame(data)    %timeit read_xml_iterparse(path)# 10 loops, best of 5: 33 ms per loop%timeit read_xml_iterparse2(path)# 10 loops, best of 5: 23 ms per loop

I wasn't sure I understood the purpose of the last if check, but I'm also not sure why you would want to lose whitespace-only elements. Removing the last if consistently shaves off a little bit of time:

def read_xml_iterparse3(path):    data = []    inner = {}    first_tag = None    for (ev, el) in et.iterparse(path):        if not first_tag:           first_tag = el.tag        if el.tag == first_tag and len(inner) != 0:            data.append(inner)                        inner = {}        inner[el.tag] = el.text    df = pd.DataFrame(data)    %timeit read_xml_iterparse(path)# 10 loops, best of 5: 34.4 ms per loop%timeit read_xml_iterparse2(path)# 10 loops, best of 5: 24.5 ms per loop%timeit read_xml_iterparse3(path)# 10 loops, best of 5: 20.9 ms per loop

Now, with or without those performance improvements, your iterparse version seems to produce an extra-large dataframe. Here seems to be a working, fast version:

def read_xml_iterparse5(path):    data = []    inner = {}    for (ev, el) in et.iterparse(path):        # /ending parents trigger a new row, and in our case .text is \n followed by spaces.  it would be more reliable to pass 'topusers' to our read_xml_iterparse5 as the .tag to check        if el.text and el.text[0] == '\n':            # ignore /stackoverflow            if inner:                data.append(inner)                inner = {}        else:            inner[el.tag] = el.text    return pd.DataFrame(data)    print(read_xml_iterfind(path).shape)# (900, 8)print(read_xml_iterparse(path).shape)# (7050, 8)print(read_xml_lxml_xpath(path).shape)# (900, 8)print(read_xml_lxml_xsl(path).shape)# (900, 8)print(read_xml_iterparse5(path).shape)# (900, 8)%timeit read_xml_iterparse5(path)# 10 loops, best of 5: 20.6 ms per loop

MEMORY: Do CPU memory correlate with timings in I/O calls? XSLT and XPath 1.0 tend not to scale well with larger XML documents as entire file must be read in memory to be parsed.

I'm not totally sure what you mean by "I/O calls" but if your document is small enough to fit in cache, then everything will be much faster as it won't evict many other items from the cache.

STRATEGY: Is list of dictionaries an optimal strategy for Dataframe() call? See these interesting answers: generator version and a iterwalk user-defined version. Both upcast lists to dataframe.

The lists use less memory, so depending on how many columns you have, it could make a noticeable difference. Of course, this then requires your XML tags to be in a consistent order, which they do appear to be. The DataFrame() call would also need to do less work, as it doesn't have to lookup keys in the dict on every row, to figure out what column if for what value.