Pandas read_csv speed up Pandas read_csv speed up pandas pandas

Pandas read_csv speed up


Let's test it!

Data generation:

sz = 10**3df = pd.DataFrame(np.random.randint(0, 10**6, (sz, 2)), columns=['i1','i2'])df['date'] = pd.date_range('2000-01-01', freq='1S', periods=len(df))df['dt2'] = pd.date_range('1980-01-01', freq='999S', periods=len(df))df['f1'] = np.random.rand(len(df))df['f2'] = np.random.rand(len(df))# generate 10 string columns for i in range(1, 11):    df['s{}'.format(i)] =  pd.util.testing.rands_array(10, len(df))df = pd.concat([df] * 10**3, ignore_index=True).sample(frac=1)df = df.set_index(df.pop('date').sort_values())

We have generated the following DF

In [59]: dfOut[59]:                         i1      i2                 dt2        f1     ...              s7          s8          s9         s10date                                                                  ...2000-01-01 00:00:00  216625    4179 1980-01-04 04:35:24  0.679989     ...      7G8rLnoocA  E7Ot7oPsJ6  puQamLn0I2  zxHrATQn0m2000-01-01 00:00:00  374740  967991 1980-01-09 11:07:48  0.202064     ...      wLETO2g8uL  MhtzNLPXCH  PW1uKxY0df  wTakdCe6nK2000-01-01 00:00:00  152181  627451 1980-01-10 11:49:39  0.956117     ...      mXOsfUPqOy  6IIst7UFDT  nL6XZxrT3r  BxPCFNdZTK2000-01-01 00:00:00  915732  730737 1980-01-06 10:25:30  0.854145     ...      Crh94m085p  M1tbrorxGT  XWSKk3b8Pv  M9FWQtPzaa2000-01-01 00:00:00  590262  248378 1980-01-06 11:48:45  0.307373     ...      wRnMPxeopd  JF24uTUwJC  2CRrs9yB2N  hxYrXFnT1H2000-01-01 00:00:00  161183  620876 1980-01-08 21:48:36  0.207536     ...      cyN0AExPO2  POaldI6Y0l  TDc13rPdT0  xgoDOW8Y1L2000-01-01 00:00:00  589696  784856 1980-01-12 02:07:21  0.909340     ...      GIRAAVBRpj  xwcnpwFohz  wqcoTMjQ4S  GTcIWXElo7...                     ...     ...                 ...       ...     ...             ...         ...         ...         ...2000-01-01 00:16:39  773606  205714 1980-01-12 07:40:21  0.895944     ...      HEkXfD7pku  1ogy12wBom  OT3KmQRFGz  Dp1cK5R4Gq2000-01-01 00:16:39  915732  730737 1980-01-06 10:25:30  0.854145     ...      Crh94m085p  M1tbrorxGT  XWSKk3b8Pv  M9FWQtPzaa2000-01-01 00:16:39  990722  567886 1980-01-03 05:50:06  0.676511     ...      gVO3g0I97R  yCqOhTVeEi  imCCeQa0WG  9tslOJGWDJ2000-01-01 00:16:39  531778  438944 1980-01-04 20:07:48  0.190714     ...      rbLmkbnO5G  ATm3BpWLC0  moLkyY2Msc  7A2UJERrBG2000-01-01 00:16:39  880791  245911 1980-01-02 15:57:36  0.014967     ...      bZuKNBvrEF  K84u9HyAmG  4yy2bsUVNn  WZQ5Vvl9zD2000-01-01 00:16:39  239866  425516 1980-01-10 05:26:42  0.667183     ...      6xukg6TVah  VEUz4d92B8  zHDxty6U3d  ItztnI5LmJ2000-01-01 00:16:39  338368  804695 1980-01-12 05:27:09  0.084818     ...      NM4fdjKBuW  LXGUbLIuw9  SHdpnttX6q  4oXKMsaOJ5[1000000 rows x 15 columns]In [60]: df.shapeOut[60]: (1000000, 15)In [61]: df.info()<class 'pandas.core.frame.DataFrame'>DatetimeIndex: 1000000 entries, 2000-01-01 00:00:00 to 2000-01-01 00:16:39Data columns (total 15 columns):i1     1000000 non-null int32i2     1000000 non-null int32dt2    1000000 non-null datetime64[ns]f1     1000000 non-null float64f2     1000000 non-null float64s1     1000000 non-null objects2     1000000 non-null objects3     1000000 non-null objects4     1000000 non-null objects5     1000000 non-null objects6     1000000 non-null objects7     1000000 non-null objects8     1000000 non-null objects9     1000000 non-null objects10    1000000 non-null objectdtypes: datetime64[ns](1), float64(2), int32(2), object(10)memory usage: 114.4+ MB#print(df.shape)#print(df.info())

Let's write it to disk in different formats: (CSV, HDF5 fixed, HDF5 table, Feather):

# CSVdf.to_csv('c:/tmp/test.csv')# HDF5 table formatdf.to_hdf('c:/tmp/test.h5', 'test', format='t')#  HDF5 fixed formatdf.to_hdf('c:/tmp/test_fix.h5', 'test')# Feather formatimport featherfeather.write_dataframe(df, 'c:/tmp/test.feather')

Timing:

Now we can measure reading from disk:

In [54]: # CSV    ...: %timeit pd.read_csv('c:/tmp/test.csv', parse_dates=['date', 'dt2'], index_col=0)1 loop, best of 3: 12.3 s per loop   # 3rd placeIn [55]: # HDF5 fixed format    ...: %timeit pd.read_hdf('c:/tmp/test_fix.h5', 'test')1 loop, best of 3: 1.85 s per loop   # 1st placeIn [56]: # HDF5 table format    ...: %timeit pd.read_hdf('c:/tmp/test.h5', 'test')1 loop, best of 3: 24.2 s per loop   # 4th placeIn [57]: # Feather    ...: %timeit feather.read_dataframe('c:/tmp/test.feather')1 loop, best of 3: 3.21 s per loop   # 2nd place

If you don't always need to read all data, then it would make sense to store your data in HDF5 table format (and make use of data_columns parameter in order to index those columns, that will be used for filtering).


A common approach I usually take when handling large datasets (~4-10 million rows, 15-30 columns) with pandas operations is to save the dataframes into .pkl files for future operations. They do take up more space (sometimes as high as 2x) in terms of file size, but reduce my load times into Jupyter Notebook from 10-50 seconds with csv, to about 1-5 seconds with pkl.

In [1]: %%time        dfTest = pd.read_pickle('./testFile.pkl')        print(dfTest.shape)Out[1]: (10820089, 23)        Wall time: 1.89 sIn [2]: %%time        dfTest = pd.read_csv('./testFile.csv')        print(dfTest.shape)Out[2]: (10820089, 23)        Wall time: 18.9 s

See the test file size differences used in this test here.

Extra tip: After I'm done performing operations on the dataset, I usually just output the dataframe back into a csv for smaller archiving of my projects.


Depending on what you're using the data for you might benefit from the "chunksize" argument in read_csv, which you can find in the docs.

If you only need to do analysis on the data (you only need to load the data once) and you aren't already using an IDE like jupyter, definitely try it out! You can load the data once and keep it in memory as you try different operations.

Any other advice I think you might find in this thread.