Efficient way to read 15 M lines csv files in python
First, lets answer the title of the question
1- How to eficiently read 15M lines of a csv containing floats
I suggest you use modin:
Generating sample data:
import modin.pandas as mpdimport pandas as pdimport numpy as npframe_data = np.random.randint(0, 10_000_000, size=(15_000_000, 2)) pd.DataFrame(frame_data*0.0001).to_csv('15mil.csv', header=False)
!wc 15mil*.csv ; du -h 15mil*.csv 15000000 15000000 480696661 15mil.csv 459M 15mil.csv
Now to the benchmarks:
%%timeit -r 3 -n 1 -tglobal df1df1 = pd.read_csv('15mil.csv', header=None) 9.7 s ± 95.1 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
%%timeit -r 3 -n 1 -tglobal df2df2 = mpd.read_csv('15mil.csv', header=None) 3.07 s ± 685 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
(df2.values == df1.values).all() True
So as we can see modin was approximatly 3 times faster on my setup.
Now to answer your specific problem
2- Cleaning a csv file that contains non numeric characters, and then reading it
As people have noted, your bottleneck is probably the converter. You are calling those lambdas 30 Million times. Even the function call overhead becomes non-trivial at that scale.
Let's attack this problem.
Generating dirty dataset:
!sed 's/.\{4\}/&)/g' 15mil.csv > 15mil_dirty.csv
Approaches
First, I tried using modin with the converters argument. Then, I tried a different approach that calls the regexp less times:
First I will create a File-like object that filters everything through your regexp:
class FilterFile(): def __init__(self, file): self.file = file def read(self, n): return re.sub(r"[^\d.,\n]", "", self.file.read(n)) def write(self, *a): return self.file.write(*a) # needed to trick pandas def __iter__(self, *a): return self.file.__iter__(*a) # needed
Then we pass it to pandas as the first argument in read_csv:
with open('15mil_dirty.csv') as file: df2 = pd.read_csv(FilterFile(file))
Benchmarks:
%%timeit -r 1 -n 1 -tglobal df1df1 = pd.read_csv('15mil_dirty.csv', header=None, converters={0: lambda x: np.float32(re.sub(r"[^\d.]", "", x)), 1: lambda x: np.float32(re.sub(r"[^\d.]", "", x))} ) 2min 28s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%%timeit -r 1 -n 1 -tglobal df2df2 = mpd.read_csv('15mil_dirty.csv', header=None, converters={0: lambda x: np.float32(re.sub(r"[^\d.]", "", x)), 1: lambda x: np.float32(re.sub(r"[^\d.]", "", x))} ) 38.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%%timeit -r 1 -n 1 -tglobal df3df3 = pd.read_csv(FilterFile(open('15mil_dirty.csv')), header=None,) 1min ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Seems like modin wins again! Unfortunatly modin has not implemented reading from buffers yet so I devised the ultimate approach.
The Ultimate Approach:
%%timeit -r 1 -n 1 -twith open('15mil_dirty.csv') as f, open('/dev/shm/tmp_file', 'w') as tmp: tmp.write(f.read().translate({ord(i):None for i in '()'}))df4 = mpd.read_csv('/dev/shm/tmp_file', header=None) 5.68 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
This uses translate
which is considerably faster than re.sub
, and also uses /dev/shm
which is in-memory filesystem that ubuntu (and other linuxes) usually provide. Any file written there will never go to disk, so it is fast.Finally, it uses modin to read the file, working around modin's buffer limitation. This approach is about 30 times faster than your approach, and it is pretty simple, also.
Well my findings are not much related to pandas, but rather some common pitfalls.
Your code: (genel_deneme) ➜ derp time python a.pypython a.py 38.62s user 0.69s system 100% cpu 39.008 total
- precompile your regex
Replace re.sub(r"[^\d.]", "", x) with precompiled version and use it in your lambdasResult : (genel_deneme) ➜ derp time python a.py python a.py 26.42s user 0.69s system 100% cpu 26.843 total
- Try to find a better way then directly using np.float32, since it's 6-10 times slower than i think you expect it to be. Following is not what you want, but i just want to show the issue here.
replace np.float32 with float and run your code. My Result: (genel_deneme) ➜ derp time python a.pypython a.py 14.79s user 0.60s system 102% cpu 15.066 total
Find another way to achieve the result with the floats. More on this issue https://stackoverflow.com/a/6053175/37491
- Divide your file and the work to subprocesses if you can. You already work on separate chunks of constant size. So basically you can divide the file and handle the job in separate processes using multiprocessing or threads.