Python Pandas: Convert 2,000,000 DataFrame rows to Binary Matrix (pd.get_dummies()) without memory error?
If you are doing something like one-hot encoding, or in any case are going to have lots of zeros, have you considered using sparse matrices? This should be done after the pre-processing e.g.:
[x, y] = preprocess_data(df_chunk)x = sparse.csr_matrix(x.values)super_x.append(x)
pandas also has a sparse type:
x=x.to_sparse()[x, y] = preprocess_data(df_chunk)super_x.append(x)
One note: since you are cutting and joining by row, csr is preferable to csc.