How to concatenate multiple pandas.DataFrames without running into MemoryError How to concatenate multiple pandas.DataFrames without running into MemoryError pandas pandas

How to concatenate multiple pandas.DataFrames without running into MemoryError


The problem is, like viewed in the others answers, a problem of memory. And a solution is to store data on disk, then to build an unique dataframe.

With such huge data, performance is an issue.

csv solutions are very slow, since conversion in text mode occurs.HDF5 solutions are shorter, more elegant and faster since using binary mode.I propose a third way in binary mode, with pickle, which seems to be even faster, but more technical and needing some more room. And a fourth, by hand.

Here the code:

import numpy as npimport pandas as pdimport osimport pickle# a DataFrame factory:dfs=[]for i in range(10):    dfs.append(pd.DataFrame(np.empty((10**5,4)),columns=range(4)))    # a csv solutiondef bycsv(dfs):    md,hd='w',True    for df in dfs:        df.to_csv('df_all.csv',mode=md,header=hd,index=None)        md,hd='a',False    #del dfs    df_all=pd.read_csv('df_all.csv',index_col=None)    os.remove('df_all.csv')     return df_all        

Better solutions :

def byHDF(dfs):    store=pd.HDFStore('df_all.h5')    for df in dfs:        store.append('df',df,data_columns=list('0123'))    #del dfs    df=store.select('df')    store.close()    os.remove('df_all.h5')    return dfdef bypickle(dfs):    c=[]    with open('df_all.pkl','ab') as f:        for df in dfs:            pickle.dump(df,f)            c.append(len(df))        #del dfs    with open('df_all.pkl','rb') as f:        df_all=pickle.load(f)        offset=len(df_all)        df_all=df_all.append(pd.DataFrame(np.empty(sum(c[1:])*4).reshape(-1,4)))                for size in c[1:]:            df=pickle.load(f)            df_all.iloc[offset:offset+size]=df.values             offset+=size    os.remove('df_all.pkl')    return df_all    

For homogeneous dataframes, we can do even better :

def byhand(dfs):    mtot=0    with open('df_all.bin','wb') as f:        for df in dfs:            m,n =df.shape            mtot += m            f.write(df.values.tobytes())            typ=df.values.dtype                    #del dfs    with open('df_all.bin','rb') as f:        buffer=f.read()        data=np.frombuffer(buffer,dtype=typ).reshape(mtot,n)        df_all=pd.DataFrame(data=data,columns=list(range(n)))     os.remove('df_all.bin')    return df_all

And some tests on (little, 32 Mb) data to compare performance. you have to multiply by about 128 for 4 Gb.

In [92]: %time w=bycsv(dfs)Wall time: 8.06 sIn [93]: %time x=byHDF(dfs)Wall time: 547 msIn [94]: %time v=bypickle(dfs)Wall time: 219 msIn [95]: %time y=byhand(dfs)Wall time: 109 ms

A check :

In [195]: (x.values==w.values).all()Out[195]: TrueIn [196]: (x.values==v.values).all()Out[196]: TrueIn [197]: (x.values==y.values).all()Out[196]: True            

Of course all of that must be improved and tuned to fit your problem.

For exemple df3 can be split in chuncks of size 'total_memory_size - df_total_size' to be able to run bypickle.

I can edit it if you give more information on your data structure and size if you want. Beautiful question !


I advice you to put your dataframes into single csv file by concatenation. Then to read your csv file.

Execute that:

# write df1 content in file.csvdf1.to_csv('file.csv', index=False)# append df2 content to file.csvdf2.to_csv('file.csv', mode='a', columns=False, index=False)# append df3 content to file.csvdf3.to_csv('file.csv', mode='a', columns=False, index=False)# free memorydel df1, df2, df3# read all df1, df2, df3 contentsdf = pd.read_csv('file.csv')

If this solution isn't enougth performante, to concat larger files than usually. Do:

df1.to_csv('file.csv', index=False)df2.to_csv('file1.csv', index=False)df3.to_csv('file2.csv', index=False)del df1, df2, df3

Then run bash command:

cat file1.csv >> file.csvcat file2.csv >> file.csvcat file3.csv >> file.csv

Or concat csv files in python :

def concat(file1, file2):    with open(file2, 'r') as filename2:        data = file2.read()    with open(file1, 'a') as filename1:        file.write(data)concat('file.csv', 'file1.csv')concat('file.csv', 'file2.csv')concat('file.csv', 'file3.csv')

After read:

df = pd.read_csv('file.csv')


Kinda taking a guess here, but maybe:

df1 = pd.concat([df1,df2])del df2df1 = pd.concat([df1,df3])del df3

Obviously, you could do that more as a loop but the key is you want to delete df2, df3, etc. as you go. As you are doing it in the question, you never clear out the old dataframes so you are using about twice as much memory as you need to.

More generally, if you are reading and concatentating, I'd do it something like this (if you had 3 CSVs: foo0, foo1, foo2):

concat_df = pd.DataFrame()for i in range(3):    temp_df = pd.read_csv('foo'+str(i)+'.csv')    concat_df = pd.concat( [concat_df, temp_df] )

In other words, as you are reading in files, you only keep the small dataframes in memory temporarily, until you concatenate them into the combined df, concat_df. As you currently do it, you are keeping around all the smaller dataframes, even after concatenating them.