Split large text file(around 50GB) into multiple files

python unix python-2.7 split

This working solution uses split command available in shell. Since the author has already accepted a possibility of a non-python solution, please do not downvote.

First, I created a test file with 1000M entries (15 GB) with

awk 'BEGIN{for (i = 0; i < 1000000000; i++) {print "123.123.123.123"} }' > t.txt

Then I used split:

split --lines=30000000 --numeric-suffixes --suffix-length=2 t.txt t

It took 5 min to produce a set of 34 small files with names t00-t33. 33 files are 458 MB each and the last t33 is 153 MB.

python unix python-2.7 split

from itertools import chain, islicedef chunks(iterable, n):   "chunks(ABCDE,2) => AB CD E"   iterable = iter(iterable)   while True:       # store one line in memory,       # chain it to an iterator on the rest of the chunk       yield chain([next(iterable)], islice(iterable, n-1))l = 30*10**6file_large = 'large_file.txt'with open(file_large) as bigfile:    for i, lines in enumerate(chunks(bigfile, l)):        file_split = '{}.{}'.format(file_large, i)        with open(file_split, 'w') as f:            f.writelines(lines)

python unix python-2.7 split

I would use the Unix utility split, if it is available to you and your only task is to split the file. Here is however a pure Python solution:

import contextlibfile_large = 'large_file.txt'l = 30*10**6  # lines per split filewith contextlib.ExitStack() as stack:    fd_in = stack.enter_context(open(file_large))    for i, line in enumerate(fd_in):        if not i % l:           file_split = '{}.{}'.format(file_large, i//l)           fd_out = stack.enter_context(open(file_split, 'w'))        fd_out.write('{}\n'.format(line))

If all of your lines have 4 3-digit numbers on them and you have multiple cores available, then you can exploit file seek and run multiple processes.

CodeHunter

Split large text file(around 50GB) into multiple files

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last