Pandas read_csv speed up
Let's test it!
Data generation:
sz = 10**3df = pd.DataFrame(np.random.randint(0, 10**6, (sz, 2)), columns=['i1','i2'])df['date'] = pd.date_range('2000-01-01', freq='1S', periods=len(df))df['dt2'] = pd.date_range('1980-01-01', freq='999S', periods=len(df))df['f1'] = np.random.rand(len(df))df['f2'] = np.random.rand(len(df))# generate 10 string columns for i in range(1, 11): df['s{}'.format(i)] = pd.util.testing.rands_array(10, len(df))df = pd.concat([df] * 10**3, ignore_index=True).sample(frac=1)df = df.set_index(df.pop('date').sort_values())
We have generated the following DF
In [59]: dfOut[59]: i1 i2 dt2 f1 ... s7 s8 s9 s10date ...2000-01-01 00:00:00 216625 4179 1980-01-04 04:35:24 0.679989 ... 7G8rLnoocA E7Ot7oPsJ6 puQamLn0I2 zxHrATQn0m2000-01-01 00:00:00 374740 967991 1980-01-09 11:07:48 0.202064 ... wLETO2g8uL MhtzNLPXCH PW1uKxY0df wTakdCe6nK2000-01-01 00:00:00 152181 627451 1980-01-10 11:49:39 0.956117 ... mXOsfUPqOy 6IIst7UFDT nL6XZxrT3r BxPCFNdZTK2000-01-01 00:00:00 915732 730737 1980-01-06 10:25:30 0.854145 ... Crh94m085p M1tbrorxGT XWSKk3b8Pv M9FWQtPzaa2000-01-01 00:00:00 590262 248378 1980-01-06 11:48:45 0.307373 ... wRnMPxeopd JF24uTUwJC 2CRrs9yB2N hxYrXFnT1H2000-01-01 00:00:00 161183 620876 1980-01-08 21:48:36 0.207536 ... cyN0AExPO2 POaldI6Y0l TDc13rPdT0 xgoDOW8Y1L2000-01-01 00:00:00 589696 784856 1980-01-12 02:07:21 0.909340 ... GIRAAVBRpj xwcnpwFohz wqcoTMjQ4S GTcIWXElo7... ... ... ... ... ... ... ... ... ...2000-01-01 00:16:39 773606 205714 1980-01-12 07:40:21 0.895944 ... HEkXfD7pku 1ogy12wBom OT3KmQRFGz Dp1cK5R4Gq2000-01-01 00:16:39 915732 730737 1980-01-06 10:25:30 0.854145 ... Crh94m085p M1tbrorxGT XWSKk3b8Pv M9FWQtPzaa2000-01-01 00:16:39 990722 567886 1980-01-03 05:50:06 0.676511 ... gVO3g0I97R yCqOhTVeEi imCCeQa0WG 9tslOJGWDJ2000-01-01 00:16:39 531778 438944 1980-01-04 20:07:48 0.190714 ... rbLmkbnO5G ATm3BpWLC0 moLkyY2Msc 7A2UJERrBG2000-01-01 00:16:39 880791 245911 1980-01-02 15:57:36 0.014967 ... bZuKNBvrEF K84u9HyAmG 4yy2bsUVNn WZQ5Vvl9zD2000-01-01 00:16:39 239866 425516 1980-01-10 05:26:42 0.667183 ... 6xukg6TVah VEUz4d92B8 zHDxty6U3d ItztnI5LmJ2000-01-01 00:16:39 338368 804695 1980-01-12 05:27:09 0.084818 ... NM4fdjKBuW LXGUbLIuw9 SHdpnttX6q 4oXKMsaOJ5[1000000 rows x 15 columns]In [60]: df.shapeOut[60]: (1000000, 15)In [61]: df.info()<class 'pandas.core.frame.DataFrame'>DatetimeIndex: 1000000 entries, 2000-01-01 00:00:00 to 2000-01-01 00:16:39Data columns (total 15 columns):i1 1000000 non-null int32i2 1000000 non-null int32dt2 1000000 non-null datetime64[ns]f1 1000000 non-null float64f2 1000000 non-null float64s1 1000000 non-null objects2 1000000 non-null objects3 1000000 non-null objects4 1000000 non-null objects5 1000000 non-null objects6 1000000 non-null objects7 1000000 non-null objects8 1000000 non-null objects9 1000000 non-null objects10 1000000 non-null objectdtypes: datetime64[ns](1), float64(2), int32(2), object(10)memory usage: 114.4+ MB#print(df.shape)#print(df.info())
Let's write it to disk in different formats: (CSV, HDF5 fixed, HDF5 table, Feather):
# CSVdf.to_csv('c:/tmp/test.csv')# HDF5 table formatdf.to_hdf('c:/tmp/test.h5', 'test', format='t')# HDF5 fixed formatdf.to_hdf('c:/tmp/test_fix.h5', 'test')# Feather formatimport featherfeather.write_dataframe(df, 'c:/tmp/test.feather')
Timing:
Now we can measure reading from disk:
In [54]: # CSV ...: %timeit pd.read_csv('c:/tmp/test.csv', parse_dates=['date', 'dt2'], index_col=0)1 loop, best of 3: 12.3 s per loop # 3rd placeIn [55]: # HDF5 fixed format ...: %timeit pd.read_hdf('c:/tmp/test_fix.h5', 'test')1 loop, best of 3: 1.85 s per loop # 1st placeIn [56]: # HDF5 table format ...: %timeit pd.read_hdf('c:/tmp/test.h5', 'test')1 loop, best of 3: 24.2 s per loop # 4th placeIn [57]: # Feather ...: %timeit feather.read_dataframe('c:/tmp/test.feather')1 loop, best of 3: 3.21 s per loop # 2nd place
If you don't always need to read all data, then it would make sense to store your data in HDF5 table format (and make use of data_columns
parameter in order to index those columns, that will be used for filtering).
A common approach I usually take when handling large datasets (~4-10 million rows, 15-30 columns) with pandas operations is to save the dataframes into .pkl files for future operations. They do take up more space (sometimes as high as 2x) in terms of file size, but reduce my load times into Jupyter Notebook from 10-50 seconds with csv, to about 1-5 seconds with pkl.
In [1]: %%time dfTest = pd.read_pickle('./testFile.pkl') print(dfTest.shape)Out[1]: (10820089, 23) Wall time: 1.89 sIn [2]: %%time dfTest = pd.read_csv('./testFile.csv') print(dfTest.shape)Out[2]: (10820089, 23) Wall time: 18.9 s
See the test file size differences used in this test here.
Extra tip: After I'm done performing operations on the dataset, I usually just output the dataframe back into a csv for smaller archiving of my projects.
Depending on what you're using the data for you might benefit from the "chunksize" argument in read_csv, which you can find in the docs.
If you only need to do analysis on the data (you only need to load the data once) and you aren't already using an IDE like jupyter, definitely try it out! You can load the data once and keep it in memory as you try different operations.
Any other advice I think you might find in this thread.