How to input large data into python pandas using looping or parallel computing?
import numpy as npfrom multiprocessing import Pooldef processor(df): # Some work df.sort_values('id', inplace=True) return dfsize = 8df_split = np.array_split(df, size)cores = 8pool = Pool(cores)for n, frame in enumerate(pool.imap(processor, df_split), start=1): frame.to_csv('{}'.format(n))pool.close()pool.join()
Use the chunksize
parameter to read one chunk at the time and save the files to disk. This will split the original file in equal parts by 100000 rows each:
file = "./data.csv"chunks = pd.read_csv(file, sep="/", header=0, dtype=str, chunksize = 100000)for it, chunk in enumerate(chunks): chunk.to_csv('chunk_{}.csv'.format(it), sep="/")
If you know the number of rows of the original file you can calculate the exact chunksize
to split the file in 8 equal parts (nrows/8
).
pandas read_csv has two argument options that you could use to do what you want to do:
nrows : to specify the number of rows you want to readskiprows : to specify the first row you want to read
Refer to documentation at: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html