How to split large text file in windows? How to split large text file in windows? windows windows

How to split large text file in windows?


If you have installed Git for Windows, you should have Git Bash installed, since that comes with Git.

Use the split command in Git Bash to split a file:

  • into files of size 500MB each: split myLargeFile.txt -b 500m

  • into files with 10000 lines each: split myLargeFile.txt -l 10000

Tips:

  • If you don't have Git/Git Bash, download at https://git-scm.com/download

  • If you lost the shortcut to Git Bash, you can run it using C:\Program Files\Git\git-bash.exe

That's it!


I always like examples though...

Example:

enter image description here

You can see in this image that the files generated by split are named xaa, xab, xac, etc.

These names are made up of a prefix and a suffix, which you can specify. Since I didn't specify what I want the prefix or suffix to look like, the prefix defaulted to x, and the suffix defaulted to a two-character alphabetical enumeration.

Another Example:

This example demonstrates

  • using a filename prefix of MySlice (instead of the default x),
  • the -d flag for using numerical suffixes (instead of aa, ab, ac, etc...),
  • and the option -a 5 to tell it I want the suffixes to be 5 digits long:

enter image description here


Set Arg = WScript.Argumentsset WshShell = createObject("Wscript.Shell")Set Inp = WScript.StdinSet Outp = Wscript.Stdout    Set rs = CreateObject("ADODB.Recordset")    With rs        .Fields.Append "LineNumber", 4         .Fields.Append "Txt", 201, 5000         .Open        LineCount = 0        Do Until Inp.AtEndOfStream            LineCount = LineCount + 1            .AddNew            .Fields("LineNumber").value = LineCount            .Fields("Txt").value = Inp.readline            .UpDate        Loop        .Sort = "LineNumber ASC"        If LCase(Arg(1)) = "t" then            If LCase(Arg(2)) = "i" then                .filter = "LineNumber < " & LCase(Arg(3)) + 1            ElseIf LCase(Arg(2)) = "x" then                .filter = "LineNumber > " & LCase(Arg(3))            End If        ElseIf LCase(Arg(1)) = "b" then            If LCase(Arg(2)) = "i" then                .filter = "LineNumber > " & LineCount - LCase(Arg(3))            ElseIf LCase(Arg(2)) = "x" then                .filter = "LineNumber < " & LineCount - LCase(Arg(3)) + 1            End If        End If        Do While not .EOF            Outp.writeline .Fields("Txt").Value            .MoveNext        Loop    End With

Cut

filter cut {t|b} {i|x} NumOfLines

Cuts the number of lines from the top or bottom of file.

t - top of the fileb - bottom of the filei - include n linesx - exclude n lines

Example

cscript /nologo filter.vbs cut t i 5 < "%systemroot%\win.ini"

Another way This outputs lines 5001+, adapt for your use. This uses almost no memory.

Do Until Inp.AtEndOfStream         Count = Count + 1         If count > 5000 then            OutP.WriteLine Inp.Readline         End IfLoop


Of course there is! Win CMD can do a lot more than just split text files :)

Split a text file into separate files of 'max' lines each:

Split text file (max lines each):: Initializeset input=file.txtset max=10000set /a line=1 >nulset /a file=1 >nulset out=!file!_%input%set /a max+=1 >nulecho Number of lines in %input%:find /c /v "" < %input%: Split filefor /f "tokens=* delims=[" %i in ('type "%input%" ^| find /v /n ""') do (if !line!==%max% (set /a line=1 >nulset /a file+=1 >nulset out=!file!_%input%echo Writing file: !out!)REM Write next fileset a=%iset a=!a:*]=]!echo:!a:~1!>>out!set /a line+=1 >nul)

If above code hangs or crashes, this example code splits files faster (by writing data to intermediate files instead of keeping everything in memory):

eg. To split a file with 7,600 lines into smaller files of maximum 3000 lines.

  1. Generate regexp string/pattern files with set command to be fed to /g flag of findstr

list1.txt

\[[0-9]\]
\[[0-9][0-9]\]
\[[0-9][0-9][0-9]\]
\[[0-2][0-9][0-9][0-9]\]

list2.txt

\[[3-5][0-9][0-9][0-9]\]

list3.txt

\[[6-9][0-9][0-9][0-9]\]

  1. Split the file into smaller files:
type "%input%" | find /v /n "" | findstr /b /r /g:list1.txt > file1.txttype "%input%" | find /v /n "" | findstr /b /r /g:list2.txt > file2.txttype "%input%" | find /v /n "" | findstr /b /r /g:list3.txt > file3.txt
  1. remove prefixed line numbers for each file split:
    eg. for the 1st file:
for /f "tokens=* delims=[" %i in ('type "%cd%\file1.txt"') do (set a=%iset a=!a:*]=]!echo:!a:~1!>>file_1.txt)

Notes:
Works with leading whitespace, blank lines & whitespace lines.

Tested on Win 10 x64 CMD, on 4.4GB text file, 5651982 lines.