Using pandas .append within for loop Using pandas .append within for loop python python

Using pandas .append within for loop


Every time you call append, Pandas returns a copy of the original dataframe plus your new row. This is called quadratic copy, and it is an O(N^2) operation that will quickly become very slow (especially since you have lots of data).

In your case, I would recommend using lists, appending to them, and then calling the dataframe constructor.

a_list = []b_list = []for data in my_data:    a, b = process_data(data)    a_list.append(a)    b_list.append(b)df = pd.DataFrame({'A': a_list, 'B': b_list})del a_list, b_list

Timings

%%timeitdata = pd.DataFrame([])for i in np.arange(0, 10000):    if i % 2 == 0:        data = data.append(pd.DataFrame({'A': i, 'B': i + 1}, index=[0]), ignore_index=True)else:    data = data.append(pd.DataFrame({'A': i}, index=[0]), ignore_index=True)1 loops, best of 3: 6.8 s per loop%%timeita_list = []b_list = []for i in np.arange(0, 10000):    if i % 2 == 0:        a_list.append(i)        b_list.append(i + 1)    else:        a_list.append(i)        b_list.append(None)data = pd.DataFrame({'A': a_list, 'B': b_list})100 loops, best of 3: 8.54 ms per loop


You need to set the the variable data equal to the appended data frame. Unlike the append method on a python list the pandas append does not happen in place

import pandas as pdimport numpy as npdata = pd.DataFrame([])for i in np.arange(0, 4):    if i % 2 == 0:        data = data.append(pd.DataFrame({'A': i, 'B': i + 1}, index=[0]), ignore_index=True)    else:        data = data.append(pd.DataFrame({'A': i}, index=[0]), ignore_index=True)print(data.head())   A    B0  0  1.01  2  3.02  3  NaN

NOTE: This answer aims to answer the question as it was posed. It is not however the optimal strategy for combining large numbers of dataframes. For a more optimal solution have a look at Alexander's answer below


You can build your dataframe without a loop:

n = 4data = pd.DataFrame({'A': np.arange(n)})data['B'] = np.NaNdata.loc[data['A'] % 2 == 0, 'B'] = data['A'] + 1

For:

n = 10000

This is a bit faster:

%%timeitdata = pd.DataFrame({'A': np.arange(n)})data['B'] = np.NaNdata.loc[data['A'] % 2 == 0, 'B'] = data['A'] + 1100 loops, best of 3: 3.3 ms per loop

vs.

%%timeita_list = []b_list = []for i in np.arange(n):    if i % 2 == 0:        a_list.append(i)        b_list.append(i + 1)    else:        a_list.append(i)        b_list.append(None)data1 = pd.DataFrame({'A': a_list, 'B': b_list})100 loops, best of 3: 12.4 ms per loop