Using pandas .append within for loop
Every time you call append, Pandas returns a copy of the original dataframe plus your new row. This is called quadratic copy, and it is an O(N^2) operation that will quickly become very slow (especially since you have lots of data).
In your case, I would recommend using lists, appending to them, and then calling the dataframe constructor.
a_list = []b_list = []for data in my_data: a, b = process_data(data) a_list.append(a) b_list.append(b)df = pd.DataFrame({'A': a_list, 'B': b_list})del a_list, b_list
Timings
%%timeitdata = pd.DataFrame([])for i in np.arange(0, 10000): if i % 2 == 0: data = data.append(pd.DataFrame({'A': i, 'B': i + 1}, index=[0]), ignore_index=True)else: data = data.append(pd.DataFrame({'A': i}, index=[0]), ignore_index=True)1 loops, best of 3: 6.8 s per loop%%timeita_list = []b_list = []for i in np.arange(0, 10000): if i % 2 == 0: a_list.append(i) b_list.append(i + 1) else: a_list.append(i) b_list.append(None)data = pd.DataFrame({'A': a_list, 'B': b_list})100 loops, best of 3: 8.54 ms per loop
You need to set the the variable data
equal to the appended data frame. Unlike the append
method on a python list the pandas append
does not happen in place
import pandas as pdimport numpy as npdata = pd.DataFrame([])for i in np.arange(0, 4): if i % 2 == 0: data = data.append(pd.DataFrame({'A': i, 'B': i + 1}, index=[0]), ignore_index=True) else: data = data.append(pd.DataFrame({'A': i}, index=[0]), ignore_index=True)print(data.head()) A B0 0 1.01 2 3.02 3 NaN
NOTE: This answer aims to answer the question as it was posed. It is not however the optimal strategy for combining large numbers of dataframes. For a more optimal solution have a look at Alexander's answer below
You can build your dataframe without a loop:
n = 4data = pd.DataFrame({'A': np.arange(n)})data['B'] = np.NaNdata.loc[data['A'] % 2 == 0, 'B'] = data['A'] + 1
For:
n = 10000
This is a bit faster:
%%timeitdata = pd.DataFrame({'A': np.arange(n)})data['B'] = np.NaNdata.loc[data['A'] % 2 == 0, 'B'] = data['A'] + 1100 loops, best of 3: 3.3 ms per loop
vs.
%%timeita_list = []b_list = []for i in np.arange(n): if i % 2 == 0: a_list.append(i) b_list.append(i + 1) else: a_list.append(i) b_list.append(None)data1 = pd.DataFrame({'A': a_list, 'B': b_list})100 loops, best of 3: 12.4 ms per loop