shuffle a large list of items without loading in memory shuffle a large list of items without loading in memory python python

shuffle a large list of items without loading in memory


If you can reserve 16 GB of memory for this program, I wrote a program called sample that shuffles the lines of a file by reading in their byte offsets, shuffling the offsets, and then printing output by seeking through the file to the shuffled offsets. It uses 8 bytes for each 64-bit offset, thus 16 GB for a two billion-line input.

It won't be fast, but on a system with enough memory, sample will shuffle files that are large enough to cause GNU shuf to fail. Further, it uses mmap routines to try to minimize the I/O expense of a second pass through your file. It also has a few other options; see --help for more details.

By default, this program will sample without replacement and shuffle by single lines. If you want to shuffle with replacement, or if your input is in FASTA, FASTQ or another multi-line format, you can add some options to adjust how sampling is done. (Or you can apply an alternative approach, which I link to in a Perl gist below, but sample addresses these cases.)

If your FASTA sequences are on every two lines, that is, they alternate between sequence header on one line and sequence data on the next, you can still shuffle with sample, and with half the memory, since you are only shuffling half the number of offsets. See the --lines-per-offset option; you'd specify 2, for instance, to shuffle pairs of lines.

In the case of FASTQ files, their records are split every four lines. You can specify --lines-per-offset=4 to shuffle a FASTQ file with a fourth of the memory required to shuffle a single-line file.

Alternatively, I have a gist here written in Perl, which will sample sequences without replacement from a FASTA file without regard for the number of lines in a sequence. Note that this isn't exactly the same as shuffling a whole file, but you could use this as a starting point, since it collects the offsets. Instead of sampling some of the offsets, you'd remove line 47 that sorts shuffled indices, then use file seek operations to read through the file, using the shuffled-index list directly.

Again, it won't be fast, because you are jumping through a very large file out of order, but storing offsets is much less expensive than storing whole lines, and adding mmap routines could help a little with what is essentially a series of random access operations. And if you are working with FASTA, you'll have still fewer offsets to store, so your memory usage (excepting any relatively insignificant container and program overhead) should be at most 8 GB — and likely less, depending on its structure.


How about:

import mmapfrom random import shuffledef find_lines(data):    for i, char in enumerate(data):        if char == '\n':            yield i def shuffle_file(in_file, out_file):    with open(in_file) as f:        data = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)        start = 0        lines = []        for end in find_lines(data):            lines.append((start, end))            start = end + 1        shuffle(lines)        with open(out_file, 'w') as out:            for start, end in lines:                out.write(data[start:end+1])if __name__ == "__main__":    shuffle_file('data', 'result')

This solution should only ever store all the file offsets of the lines in the file, that's 2 words per line, plus container overhead.


You may check my HugeFileProcessor tool. It's similar to @Alex-Reynolds's sample, but should be significantly faster as there would be no seeks.

Here are the details on shuffling implementation. It requires specifying batchSize - number of lines to keep in RAM when writing to output. The more is the better (unless you are out of RAM), because total shuffling time would be (number of lines in sourceFile) / batchSize * (time to fully read sourceFile). Please note that the program shuffles whole file, not on per-batch basis.

The algorithm is as follows.

  1. Count lines in sourceFile. This is done simply by reading whole file line-by-line. (See some comparisons here.) This also gives a measurement of how much time would it take to read whole file once. So we could estimate how many times it would take to make a complete shuffle because it would require Ceil(linesCount / batchSize) complete file reads.

  2. As we now know the total linesCount, we can create an index array of linesCount size and shuffle it using Fisher–Yates (called orderArray in the code). This would give us an order in which we want to have lines in a shuffled file. Note that this is a global order over the whole file, not per batch or chunk or something.

  3. Now the actual code. We need to get all lines from sourceFile in a order we just computed, but we can't read whole file in memory. So we just split the task.

    • We would go through the sourceFile reading all lines and storing in memory only those lines that would be in first batchSize of the orderArray. When we get all these lines, we could write them into outFile in required order, and it's a batchSize/linesCount of work done.
    • Next we would repeat whole process again and again taking next parts of orderArray and reading sourceFile from start to end for each part. Eventually the whole orderArray is processed and we are done.

Why it works?

Because all we do is just reading the source file from start to end. No seeks forward/backward, and that's what HDDs like. File gets read in chunks according to internal HDD buffers, FS blocks, CPU cahce, etc. and everything is being read sequentially.

Some numbers

On my machine (Core i5, 16GB RAM, Win8.1, HDD Toshiba DT01ACA200 2TB, NTFS) I was able to shuffle a file of 132 GB (84 000 000 lines) in around 5 hours using batchSize of 3 500 000. With batchSize of 2 000 000 it took around 8 hours. Reading speed was around 118000 lines per second.