What is the fastest way to output large DataFrame into a CSV file?

python performance pandas output

Lev. Pandas has rewritten to_csv to make a big improvement in native speed. The process is now i/o bound, accounts for many subtle dtype issues, and quote cases. Here is our performance results vs. 0.10.1 (in the upcoming 0.11) release. These are in ms, lower ratio is better.

Results:                                            t_head  t_baseline      rationame                                                                     frame_to_csv2 (100k) rows                 190.5260   2244.4260     0.0849write_csv_standard  (10k rows)             38.1940    234.2570     0.1630frame_to_csv_mixed  (10k rows, mixed)     369.0670   1123.0412     0.3286frame_to_csv (3k rows, wide)              112.2720    226.7549     0.4951

So Throughput for a single dtype (e.g. floats), not too wide is about 20M rows / min, here is your example from above.

In [12]: df = pd.DataFrame({'A' : np.array(np.arange(45000000),dtype='float64')}) In [13]: df['B'] = df['A'] + 1.0   In [14]: df['C'] = df['A'] + 2.0In [15]: df['D'] = df['A'] + 2.0In [16]: %timeit -n 1 -r 1 df.to_csv('test.csv')1 loops, best of 1: 119 s per loop

python performance pandas output

In 2019 for cases like this, it may be better to just use numpy. Look at the timings:

aa.to_csv('pandas_to_csv', index=False)# 6.47 sdf2csv(aa,'code_from_question', myformats=['%d','%.1f','%.1f','%.1f'])# 4.59 sfrom numpy import savetxtsavetxt(    'numpy_savetxt', aa.values, fmt='%d,%.1f,%.1f,%.1f',    header=','.join(aa.columns), comments='')# 3.5 s

So you can cut the time by a factor of two using numpy. This, of course, comes at a cost of reduced flexibility (when compared to aa.to_csv).

Benchmarked with Python 3.7, pandas 0.23.4, numpy 1.15.2 (xrange was replaced by range to make the posted function from the question work in Python 3).

PS. If you need to include the index, savetxt will work fine - just pass df.reset_index().values and adjust the formatting string accordingly.

2021 update: as pointed in the comments the pandas performance improved greatly. savetxt is still the fastest option, but only by a narrow margin: when benchmarked with pandas 1.3.0 and numpy 1.20.3, aa.to_csv() took 2.64 s while savetxt 2.53 s. The code from the question (df2csv) took 2.98 s making it the slowest option nowadays.

Your mileage may vary - the 2021 test was performed on SSD with a very fast CPU, while in 2019 I was using HDD and a slower CPU.

python performance pandas output

use chunksize. I have found that makes a hell lot of difference. If you have memory in hand use good chunksize (no of rows) to get into memory and then write once.

CodeHunter

What is the fastest way to output large DataFrame into a CSV file?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last