Split large text file(around 50GB) into multiple files
This working solution uses split
command available in shell. Since the author has already accepted a possibility of a non-python solution, please do not downvote.
First, I created a test file with 1000M entries (15 GB) with
awk 'BEGIN{for (i = 0; i < 1000000000; i++) {print "123.123.123.123"} }' > t.txt
Then I used split
:
split --lines=30000000 --numeric-suffixes --suffix-length=2 t.txt t
It took 5 min to produce a set of 34 small files with names t00
-t33
. 33 files are 458 MB each and the last t33
is 153 MB.
from itertools import chain, islicedef chunks(iterable, n): "chunks(ABCDE,2) => AB CD E" iterable = iter(iterable) while True: # store one line in memory, # chain it to an iterator on the rest of the chunk yield chain([next(iterable)], islice(iterable, n-1))l = 30*10**6file_large = 'large_file.txt'with open(file_large) as bigfile: for i, lines in enumerate(chunks(bigfile, l)): file_split = '{}.{}'.format(file_large, i) with open(file_split, 'w') as f: f.writelines(lines)
I would use the Unix utility split, if it is available to you and your only task is to split the file. Here is however a pure Python solution:
import contextlibfile_large = 'large_file.txt'l = 30*10**6 # lines per split filewith contextlib.ExitStack() as stack: fd_in = stack.enter_context(open(file_large)) for i, line in enumerate(fd_in): if not i % l: file_split = '{}.{}'.format(file_large, i//l) fd_out = stack.enter_context(open(file_split, 'w')) fd_out.write('{}\n'.format(line))
If all of your lines have 4 3-digit numbers on them and you have multiple cores available, then you can exploit file seek and run multiple processes.