Improve Row Append Performance On Pandas DataFrames Improve Row Append Performance On Pandas DataFrames numpy numpy

Improve Row Append Performance On Pandas DataFrames


I also used the dataframe's append function inside a loop and I was perplexed how slow it ran.

A useful example for those who are suffering, based on the correct answer on this page.

Python version: 3

Pandas version: 0.20.3

# the dictionary to pass to pandas dataframed = {}# a counter to use to add entries to "dict"i = 0 # Example data to loop and append to a dataframedata = [{"foo": "foo_val_1", "bar": "bar_val_1"},        {"foo": "foo_val_2", "bar": "bar_val_2"}]# the loopfor entry in data:    # add a dictionary entry to the final dictionary    d[i] = {"col_1_title": entry['foo'], "col_2_title": entry['bar']}        # increment the counter    i = i + 1# create the dataframe using 'from_dict'# important to set the 'orient' parameter to "index" to make the keys as rowsdf = DataFrame.from_dict(d, "index")

The "from_dict" function: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html


Appending rows to lists is far more efficient than to a DataFrame.Hence you would want to

  1. append the rows to a list.
  2. Then convert it into DataFrame and
  3. set the index as required.


I think the best way to do it is, if you know the data you are going to receive, allocate before hand.

import numpy as npimport pandas as pdrandom_matrix = np.random.randn(100, 100)insert_df = pd.DataFrame(random_matrix)df = pd.DataFrame(columns=range(100), index=range(200))df.loc[range(100), df.columns] = random_matrixdf.loc[range(100, 200), df.columns] = random_matrix

This is the pattern that I think makes the most sense. append will be faster if you have a very small dataframe, but it doesn't scale.

In [1]: import numpy as np; import pandas as pdIn [2]: random_matrix = np.random.randn(100, 100)   ...: insert_df = pd.DataFrame(random_matrix)   ...: df = pd.DataFrame(np.random.randn(100, 100))In [2]: %timeit df.append(insert_df)272 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)In [3]: %timeit df.loc[range(100), df.columns] = random_matrix493 µs ± 4.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)In [4]: %timeit df.loc[range(100), df.columns] = insert_df821 µs ± 8.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

When we run this with a 100,000 row dataframe, we see much more dramatic results.

In [1]: df = pd.DataFrame(np.random.randn(100_000, 100))In [2]: %timeit df.append(insert_df)17.9 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)In [3]: %timeit df.loc[range(100), df.columns] = random_matrix465 µs ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)In [4]: %timeit df.loc[range(99_900, 100_000), df.columns] = random_matrix465 µs ± 5.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)In [5]: %timeit df.loc[range(99_900, 100_000), df.columns] = insert_df1.02 ms ± 3.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So we can see an append is about 17 times slower than an insert with a dataframe, and 35 times slower than an insert with a numpy array.