How can I efficiently open 30gb of file and process pieces of it without slowing down? How can I efficiently open 30gb of file and process pieces of it without slowing down? pandas pandas

How can I efficiently open 30gb of file and process pieces of it without slowing down?


A couple of things come to mind.

First, if you bring the data into a pandas DataFrame there is a 'chunksize' argument for importing large data. It allows you to process / dump what you need / don't, while proving information such as df.describe which will give you summary stats.

Also, I hear great things about dask. It is a scalable platform via parallel, multi-core, multi-machine processing and is almost as simple as using pandas and numpy with very little management of resources required.


Use pandas or dask and pay attention to the options for read_csv(). mainly: chunck_size, nrows, skiprows, usecols, engine (use C), low_memory, memory_map

[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html][1]


The problem here is you are rereading a HUGE unindexed file multiple times line by line from the start of the file. No wonder this takes HOURS.

Each islice in your code is starting from the very beginning of the file -- every time -- and reads and discards all the lines from the file before it even reaches the start of the data of interest. That is very slow and inefficient.

The solution is to create a poor man's index to that file and then read smaller chucks for each slice.

Let's create a test file:

from pathlib import Pathp=Path('/tmp/file')with open(p, 'w') as f:    for i in range(1024*1024*500):        f.write(f'{i+1}\n')        print(f'{p} is {p.stat().st_size/(1024**3):.2f} GB')        

That creates a file of roughly 4.78 GB. Not as large as 30 GB but big enough to be SLOW if you are not thoughtful.

Now try reading that entire file line-by-line with the Unix utility wc to count the total lines (wc being the fastest way to count lines, in general):

$ time wc -l /tmp/file 524288000 filereal    0m3.088suser    0m2.452ssys 0m0.636s

Compare that with the speed for Python3 to read the file line by line and print the total:

$ time python3 -c 'with open("file","r") as f: print(sum(1 for l in f))'524288000real    0m53.880suser    0m52.849ssys 0m0.940s

Python is almost 18x slower to read the file line by line than wc.

Now do a further comparison. Check out the speed of the Unix utility tail to print the last n lines of a file:

$ time tail -n 3 file 524287998524287999524288000real    0m0.007suser    0m0.003ssys 0m0.004s

The tail utility is 445x faster than wc (and roughly 8,000x faster than Python) to get to the last three lines of the file because it is using a windowed index buffer. ie, tail reads some number of bytes at the end of the file and then gets the last n lines from the buffer it read.

It is possible to use the same tail approach to your application.

Consider this photo:

enter image description here

The approach you are using the equivalent to reading every tape on that rack to find the data that is on two of the middle tapes only -- and doing it over and over and over...

In the 1950's (the era of the photo) each tape was roughly indexed for what it held. Computers would call for a specific tape in the rack -- not ALL the tapes in the rack.

The solution to your issue (in oversight) is to build a tape-like indexing scheme:

  1. Run through the 30 GB file ONCE and create an index of sub-blocks by starting line number for the block. Think of each sub-block as roughly one tape (except it runs easily all the way to the end...)
  2. Instead of using int.seek(0) before every read, you would seek to the block that contains the line number of interest (just like tail does) and then use islice offset adjusted to where that block's starting line number is in relationship to the start of the file.
  3. You have a MASSIVE advantage compared with what they had to do in the 50's and 60's: You only need to calculate the starting block since you have access to the entire remaining file. 1950's tape index would call for tapes x,y,z,... to read data larger than one tape could hold. You only need to find x which contains the starting line number of interest.
  4. BTW, since each IBM tape of this type held roughly 3 MB, your 30 GB file would be more than 10 million of these tapes...

Correctly implemented (and this is not terribly hard to do) it would speed up the read performance by 100x or more.

Constructing a useful index to a text file by line offset might be as easy as something like this:

def index_file(p, delimiter=b'\n', block_size=1024*1024):    index={0:0}    total_lines, cnt=(0,0)    with open(p, 'rb') as f:        while buf:=f.raw.read(block_size):            cnt=buf.count(delimiter)            idx=buf.rfind(delimiter)            key=cnt+total_lines            index[key]=f.tell()-(len(buf)-idx)+len(delimiter)            total_lines+=cnt         return index# this index is created in about 4.9 seconds on my computer...# with a 4.8 GB file, there are ~4,800 index entries

That constructs an index correlating starting line number (in that block) with byte offset from the beginning of the file:

>>> idx=index_file(p)>>> idx{0: 0, 165668: 1048571, 315465: 2097150, 465261: 3145722,... 524179347: 5130682368, 524284204: 5131730938, 524288000: 5131768898}

Then if you want access to lines[524179347:524179500] you don't need to read 4.5 GB to get there; you can just do f.seek(5130682368) and start reading instantly.