How to input large data into python pandas using looping or parallel computing? How to input large data into python pandas using looping or parallel computing? pandas pandas

How to input large data into python pandas using looping or parallel computing?


import numpy as npfrom multiprocessing import Pooldef processor(df):    # Some work    df.sort_values('id', inplace=True)    return dfsize = 8df_split = np.array_split(df, size)cores = 8pool = Pool(cores)for n, frame in enumerate(pool.imap(processor, df_split), start=1):    frame.to_csv('{}'.format(n))pool.close()pool.join()


Use the chunksize parameter to read one chunk at the time and save the files to disk. This will split the original file in equal parts by 100000 rows each:

file = "./data.csv"chunks = pd.read_csv(file, sep="/", header=0, dtype=str, chunksize = 100000)for it, chunk in enumerate(chunks):    chunk.to_csv('chunk_{}.csv'.format(it), sep="/") 

If you know the number of rows of the original file you can calculate the exact chunksize to split the file in 8 equal parts (nrows/8).


pandas read_csv has two argument options that you could use to do what you want to do:

nrows : to specify the number of rows you want to readskiprows : to specify the first row you want to read

Refer to documentation at: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html