Sort very large text file in PowerShell Sort very large text file in PowerShell powershell powershell

Sort very large text file in PowerShell


Get-Content is terribly ineffective for reading large files. Sort-Object is not very fast, too.

Let's set up a base line:

$sw = [System.Diagnostics.Stopwatch]::StartNew();$c = Get-Content .\log3.txt -Encoding Ascii$sw.Stop();Write-Output ("Reading took {0}" -f $sw.Elapsed);$sw = [System.Diagnostics.Stopwatch]::StartNew();$s = $c | Sort-Object;$sw.Stop();Write-Output ("Sorting took {0}" -f $sw.Elapsed);$sw = [System.Diagnostics.Stopwatch]::StartNew();$u = $s | Get-Unique$sw.Stop();Write-Output ("uniq took {0}" -f $sw.Elapsed);$sw = [System.Diagnostics.Stopwatch]::StartNew();$u | Out-File 'result.txt' -Encoding ascii$sw.Stop();Write-Output ("saving took {0}" -f $sw.Elapsed);

With a 40 MB file having 1.6 million lines (made of 100k unique lines repeated 16 times) this script produces the following output on my machine:

Reading took 00:02:16.5768663Sorting took 00:02:04.0416976uniq took 00:01:41.4630661saving took 00:00:37.1630663

Totally unimpressive: more than 6 minutes to sort tiny file. Every step can be improved a lot. Let's use StreamReader to read file line by line into HashSet which will remove duplicates, then copy data to List and sort it there, then use StreamWriter to dump results back.

$hs = new-object System.Collections.Generic.HashSet[string]$sw = [System.Diagnostics.Stopwatch]::StartNew();$reader = [System.IO.File]::OpenText("D:\log3.txt")try {    while (($line = $reader.ReadLine()) -ne $null)    {        $t = $hs.Add($line)    }}finally {    $reader.Close()}$sw.Stop();Write-Output ("read-uniq took {0}" -f $sw.Elapsed);$sw = [System.Diagnostics.Stopwatch]::StartNew();$ls = new-object system.collections.generic.List[string] $hs;$ls.Sort();$sw.Stop();Write-Output ("sorting took {0}" -f $sw.Elapsed);$sw = [System.Diagnostics.Stopwatch]::StartNew();try{    $f = New-Object System.IO.StreamWriter "d:\result2.txt";    foreach ($s in $ls)    {        $f.WriteLine($s);    }}finally{    $f.Close();}$sw.Stop();Write-Output ("saving took {0}" -f $sw.Elapsed);

this script produces:

read-uniq took 00:00:32.2225181sorting took 00:00:00.2378838saving took 00:00:01.0724802

On same input file it runs more than 10 times faster. I am still surprised though it takes 30 seconds to read file from disk.


I've grown to hate this part of windows powershell, it is a memory hog on these larger files. One trick is to read the lines [System.IO.File]::ReadLines('file.txt') | sort -u | out-file file2.txt -encoding ascii

Another trick, seriously is to just use linux.

cat file.txt | sort -u > output.txt

Linux is so insanely fast at this, it makes me wonder what the heck microsoft is thinking with this set up.

It may not be feasible in all cases, and i understand, but if you have a linux machine, you can copy 500 megs to it, sort and unique it, and copy it back in under a couple minutes.


If each line of the log is prefixed with a timestamp, and the log messages don't contain embedded newlines (which would require special handling), I think it would take less memory and execution time to convert the timestamp from [String] to [DateTime] before sorting. The following assumes each log entry is of the format yyyy-MM-dd HH:mm:ss: <Message> (note that the HH format specifier is used for a 24-hour clock):

Get-Content unsorted.txt    | ForEach-Object {        # Ignore empty lines; can substitute with [String]::IsNullOrWhitespace($_) on PowerShell 3.0 and above        if (-not [String]::IsNullOrEmpty($_))        {            # Split into at most two fields, even if the message itself contains ': '            [String[]] $fields = $_ -split ': ', 2;            return New-Object -TypeName 'PSObject' -Property @{                Timestamp = [DateTime] $fields[0];                Message   = $fields[1];            };        }    } | Sort-Object -Property 'Timestamp', 'Message';

If you are processing the input file for interactive display purposes you can pipe the above into Out-GridView or Format-Table to view the results. If you need to save the sorted results you can pipe the above into the following:

    | ForEach-Object {        # Reconstruct the log entry format of the input file        return '{0:yyyy-MM-dd HH:mm:ss}: {1}' -f $_.Timestamp, $_.Message;    } `    | Out-File -Encoding 'UTF8' -FilePath 'sorted.txt';