Split large text file(around 50GB) into multiple files Split large text file(around 50GB) into multiple files unix unix

Split large text file(around 50GB) into multiple files


This working solution uses split command available in shell. Since the author has already accepted a possibility of a non-python solution, please do not downvote.

First, I created a test file with 1000M entries (15 GB) with

awk 'BEGIN{for (i = 0; i < 1000000000; i++) {print "123.123.123.123"} }' > t.txt

Then I used split:

split --lines=30000000 --numeric-suffixes --suffix-length=2 t.txt t

It took 5 min to produce a set of 34 small files with names t00-t33. 33 files are 458 MB each and the last t33 is 153 MB.


from itertools import chain, islicedef chunks(iterable, n):   "chunks(ABCDE,2) => AB CD E"   iterable = iter(iterable)   while True:       # store one line in memory,       # chain it to an iterator on the rest of the chunk       yield chain([next(iterable)], islice(iterable, n-1))l = 30*10**6file_large = 'large_file.txt'with open(file_large) as bigfile:    for i, lines in enumerate(chunks(bigfile, l)):        file_split = '{}.{}'.format(file_large, i)        with open(file_split, 'w') as f:            f.writelines(lines)


I would use the Unix utility split, if it is available to you and your only task is to split the file. Here is however a pure Python solution:

import contextlibfile_large = 'large_file.txt'l = 30*10**6  # lines per split filewith contextlib.ExitStack() as stack:    fd_in = stack.enter_context(open(file_large))    for i, line in enumerate(fd_in):        if not i % l:           file_split = '{}.{}'.format(file_large, i//l)           fd_out = stack.enter_context(open(file_split, 'w'))        fd_out.write('{}\n'.format(line))

If all of your lines have 4 3-digit numbers on them and you have multiple cores available, then you can exploit file seek and run multiple processes.