Splitting large text file into smaller text files by line numbers using Python Splitting large text file into smaller text files by line numbers using Python python python

Splitting large text file into smaller text files by line numbers using Python


lines_per_file = 300smallfile = Nonewith open('really_big_file.txt') as bigfile:    for lineno, line in enumerate(bigfile):        if lineno % lines_per_file == 0:            if smallfile:                smallfile.close()            small_filename = 'small_file_{}.txt'.format(lineno + lines_per_file)            smallfile = open(small_filename, "w")        smallfile.write(line)    if smallfile:        smallfile.close()


Using itertools grouper recipe:

from itertools import zip_longestdef grouper(n, iterable, fillvalue=None):    "Collect data into fixed-length chunks or blocks"    # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx    args = [iter(iterable)] * n    return zip_longest(fillvalue=fillvalue, *args)n = 300with open('really_big_file.txt') as f:    for i, g in enumerate(grouper(n, f, fillvalue=''), 1):        with open('small_file_{0}'.format(i * n), 'w') as fout:            fout.writelines(g)

The advantage of this method as opposed to storing each line in a list, is that it works with iterables, line by line, so it doesn't have to store each small_file into memory at once.

Note that the last file in this case will be small_file_100200 but will only go until line 100000. This happens because fillvalue='', meaning I write out nothing to the file when I don't have any more lines left to write because a group size doesn't divide equally. You can fix this by writing to a temp file and then renaming it after instead of naming it first like I have. Here's how that can be done.

import os, tempfilewith open('really_big_file.txt') as f:    for i, g in enumerate(grouper(n, f, fillvalue=None)):        with tempfile.NamedTemporaryFile('w', delete=False) as fout:            for j, line in enumerate(g, 1): # count number of lines in group                if line is None:                    j -= 1 # don't count this line                    break                fout.write(line)        os.rename(fout.name, 'small_file_{0}.txt'.format(i * n + j))

This time the fillvalue=None and I go through each line checking for None, when it occurs, I know the process has finished so I subtract 1 from j to not count the filler and then write the file.


I do this a more understandable way and using less short cuts in order to give you a further understanding of how and why this works. Previous answers work, but if you are not familiar with certain built-in-functions, you will not understand what the function is doing.

Because you posted no code I decided to do it this way since you could be unfamiliar with things other than basic python syntax given that the way you phrased the question made it seem as though you did not try nor had any clue as how to approach the question

Here are the steps to do this in basic python:

First you should read your file into a list for safekeeping:

my_file = 'really_big_file.txt'hold_lines = []with open(my_file,'r') as text_file:    for row in text_file:        hold_lines.append(row)

Second, you need to set up a way of creating the new files by name! I would suggest a loop along with a couple counters:

outer_count = 1line_count = 0sorting = Truewhile sorting:    count = 0    increment = (outer_count-1) * 300    left = len(hold_lines) - increment    file_name = "small_file_" + str(outer_count * 300) + ".txt"

Third, inside that loop you need some nested loops that will save the correct rows into an array:

hold_new_lines = []    if left < 300:        while count < left:            hold_new_lines.append(hold_lines[line_count])            count += 1            line_count += 1        sorting = False    else:        while count < 300:            hold_new_lines.append(hold_lines[line_count])            count += 1            line_count += 1

Last thing, again in your first loop you need to write the new file and add your last counter increment so your loop will go through again and write a new file

outer_count += 1with open(file_name,'w') as next_file:    for row in hold_new_lines:        next_file.write(row)

note: if the number of lines is not divisible by 300, the last file will have a name that does not correspond to the last file line.

It is important to understand why these loops work. You have it set so that on the next loop, the name of the file that you write changes because you have the name dependent on a changing variable. This is a very useful scripting tool for file accessing, opening, writing, organizing etc.

In case you could not follow what was in what loop, here is the entirety of the function:

my_file = 'really_big_file.txt'sorting = Truehold_lines = []with open(my_file,'r') as text_file:    for row in text_file:        hold_lines.append(row)outer_count = 1line_count = 0while sorting:    count = 0    increment = (outer_count-1) * 300    left = len(hold_lines) - increment    file_name = "small_file_" + str(outer_count * 300) + ".txt"    hold_new_lines = []    if left < 300:        while count < left:            hold_new_lines.append(hold_lines[line_count])            count += 1            line_count += 1        sorting = False    else:        while count < 300:            hold_new_lines.append(hold_lines[line_count])            count += 1            line_count += 1    outer_count += 1    with open(file_name,'w') as next_file:        for row in hold_new_lines:            next_file.write(row)