Python Pandas: Convert 2,000,000 DataFrame rows to Binary Matrix (pd.get_dummies()) without memory error? Python Pandas: Convert 2,000,000 DataFrame rows to Binary Matrix (pd.get_dummies()) without memory error? numpy numpy

Python Pandas: Convert 2,000,000 DataFrame rows to Binary Matrix (pd.get_dummies()) without memory error?


If you are doing something like one-hot encoding, or in any case are going to have lots of zeros, have you considered using sparse matrices? This should be done after the pre-processing e.g.:

[x, y] = preprocess_data(df_chunk)x = sparse.csr_matrix(x.values)super_x.append(x)

pandas also has a sparse type:

x=x.to_sparse()[x, y] = preprocess_data(df_chunk)super_x.append(x)

One note: since you are cutting and joining by row, csr is preferable to csc.