Creating an empty Pandas DataFrame, then filling it?
NEVER grow a DataFrame!
TLDR; (just read the bold text)
Most answers here will tell you how to create an empty DataFrame and fill it out, but no one will tell you that it is a bad thing to do.
Here is my advice: Accumulate data in a list, not a DataFrame.
Use a list to collect your data, then initialise a DataFrame when you are ready. Either a list-of-lists or list-of-dicts format will work, pd.DataFrame
accepts both.
data = []for a, b, c in some_function_that_yields_data(): data.append([a, b, c])df = pd.DataFrame(data, columns=['A', 'B', 'C'])
Pros of this approach:
It is always cheaper to append to a list and create a DataFrame in one go than it is to create an empty DataFrame (or one of NaNs) and append to it over and over again.
Lists also take up less memory and are a much lighter data structure to work with, append, and remove (if needed).
dtypes
are automatically inferred (rather than assigningobject
to all of them).A
RangeIndex
is automatically created for your data, instead of you having to take care to assign the correct index to the row you are appending at each iteration.
If you aren't convinced yet, this is also mentioned in the documentation:
Iteratively appending rows to a DataFrame can be more computationallyintensive than a single concatenate. A better solution is to appendthose rows to a list and then concatenate the list with the originalDataFrame all at once.
But what if my function returns smaller DataFrames that I need to combine into one large DataFrame?
That's fine, you can still do this in linear time by growing or creating a python list of smaller DataFrames, then calling pd.concat
.
small_dfs = []for small_df in some_function_that_yields_dataframes(): small_dfs.append(small_df)large_df = pd.concat(small_dfs, ignore_index=True)
or, more concisely:
large_df = pd.concat( list(some_function_that_yields_dataframes()), ignore_index=True)
These options are horrible
append
or concat
inside a loop
Here is the biggest mistake I've seen from beginners:
df = pd.DataFrame(columns=['A', 'B', 'C'])for a, b, c in some_function_that_yields_data(): df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True) # yuck # or similarly, # df = pd.concat([df, pd.Series({'A': i, 'B': b, 'C': c})], ignore_index=True)
Memory is re-allocated for every append
or concat
operation you have. Couple this with a loop and you have a quadratic complexity operation.
The other mistake associated with df.append
is that users tend to forget append is not an in-place function, so the result must be assigned back. You also have to worry about the dtypes:
df = pd.DataFrame(columns=['A', 'B', 'C'])df = df.append({'A': 1, 'B': 12.3, 'C': 'xyz'}, ignore_index=True)df.dtypesA object # yuck!B float64C objectdtype: object
Dealing with object columns is never a good thing, because pandas cannot vectorize operations on those columns. You will need to do this to fix it:
df.infer_objects().dtypesA int64B float64C objectdtype: object
loc
inside a loop
I have also seen loc
used to append to a DataFrame that was created empty:
df = pd.DataFrame(columns=['A', 'B', 'C'])for a, b, c in some_function_that_yields_data(): df.loc[len(df)] = [a, b, c]
As before, you have not pre-allocated the amount of memory you need each time, so the memory is re-grown each time you create a new row. It's just as bad as append
, and even more ugly.
Empty DataFrame of NaNs
And then, there's creating a DataFrame of NaNs, and all the caveats associated therewith.
df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))df A B C0 NaN NaN NaN1 NaN NaN NaN2 NaN NaN NaN3 NaN NaN NaN4 NaN NaN NaN
It creates a DataFrame of object columns, like the others.
df.dtypesA object # you DON'T want thisB objectC objectdtype: object
Appending still has all the issues as the methods above.
for i, (a, b, c) in enumerate(some_function_that_yields_data()): df.iloc[i] = [a, b, c]
The Proof is in the Pudding
Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.
Here's a couple of suggestions:
Use date_range
for the index:
import datetimeimport pandas as pdimport numpy as nptodays_date = datetime.datetime.now().date()index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')columns = ['A','B', 'C']
Note: we could create an empty DataFrame (with NaN
s) simply by writing:
df_ = pd.DataFrame(index=index, columns=columns)df_ = df_.fillna(0) # with 0s rather than NaNs
To do these type of calculations for the data, use a numpy array:
data = np.array([np.arange(10)]*3).T
Hence we can create the DataFrame:
In [10]: df = pd.DataFrame(data, index=index, columns=columns)In [11]: dfOut[11]: A B C2012-11-29 0 0 02012-11-30 1 1 12012-12-01 2 2 22012-12-02 3 3 32012-12-03 4 4 42012-12-04 5 5 52012-12-05 6 6 62012-12-06 7 7 72012-12-07 8 8 82012-12-08 9 9 9
If you simply want to create an empty data frame and fill it with some incoming data frames later, try this:
newDF = pd.DataFrame() #creates a new dataframe that's emptynewDF = newDF.append(oldDF, ignore_index = True) # ignoring index is optional# try printing some data from newDFprint newDF.head() #again optional
In this example I am using this pandas doc to create a new data frame and then using append to write to the newDF with data from oldDF.
If I have to keep appending new data into this newDF from more than one oldDFs, I just use a for loop to iterate over pandas.DataFrame.append()