Speed of Powershell Script. Optimisation sought

Get-Content is extremely slow in the default mode that produces an array when the file contains millions of lines on all PowerShell versions, including 5.1. What's worse, you're assigning it to a variable so until the entire file is read and split into lines nothing else happens. On Intel i7 3770K CPU at 3.9GHz $csv = Get-Content $path takes more than 2 minutes to read a 350MB file with 8 million lines.
Solution: Use IO.StreamReader to read a line and process it immediately.
In PowerShell2 StreamReader is less optimized than in PS3+ but still faster than Get-Content.

Pipelining via | is at least several times slower than direct enumeration via flow control statements such as while or foreach statement (not cmdlet).
Solution: use the statements.

Splitting each line into an array of strings is slower than manipulating only one string.
Solution: use IndexOf and Replace method (not operator) to count character occurrences.

PowerShell always creates an internal pipeline when loops are used.
Solution: use the Invoke-Command { } trick for 2-3x speedup in this case!

Below is PS2-compatible code.
It's faster in PS3+ (30 seconds for 8 million lines in a 350MB csv on my PC).

$reader = New-Object IO.StreamReader ('r:\data.csv', [Text.Encoding]::UTF8, $true, 4MB)$header = $reader.ReadLine()$numCol = $header.Split(',').count$writer1 = New-Object IO.StreamWriter ('r:\1.csv', $false, [Text.Encoding]::UTF8, 4MB)$writer2 = New-Object IO.StreamWriter ('r:\2.csv', $false, [Text.Encoding]::UTF8, 4MB)$writer1.WriteLine($header)$writer2.WriteLine($header)Write-Progress 'Filtering...' -status ' '$watch = [Diagnostics.Stopwatch]::StartNew()$currLine = 0Invoke-Command { # the speed-up trick: disables internal pipelinewhile (!$reader.EndOfStream) {    $s = $reader.ReadLine()    $slen = $s.length    if ($slen-$s.IndexOf(',')-1 -ge 40 -and $slen-$s.Replace(',','').length+1 -eq $numCol){        $writer1.WriteLine($s)    } else {        $writer2.WriteLine($s)    }    if (++$currLine % 10000 -eq 0) {        $pctDone = $reader.BaseStream.Position / $reader.BaseStream.Length        Write-Progress 'Filtering...' -status "Line: $currLine" `            -PercentComplete ($pctDone * 100) `            -SecondsRemaining ($watch.ElapsedMilliseconds * (1/$pctDone - 1) / 1000)    }}} #Invoke-Command endWrite-Progress 'Filtering...' -Completed -status ' 'echo "Elapsed $($watch.Elapsed)"$reader.close()$writer1.close()$writer2.close()

Another approach is to use regex in two passes (it's slower than the above code, though).
PowerShell 3 or newer is required due to array element property shorthand syntax:

$text = [IO.File]::ReadAllText('r:\data.csv')$header = $text.substring(0, $text.indexOfAny("`r`n"))$numCol = $header.split(',').count$rx = [regex]"\r?\n(?:[^,]*,){$($numCol-1)}[^,]*?(?=\r?\n|$)"[IO.File]::WriteAllText('r:\1.csv', $header + "`r`n" +                                    ($rx.matches($text).groups.value -join "`r`n"))[IO.File]::WriteAllText('r:\2.csv', $header + "`r`n" + $rx.replace($text, ''))

performance powershell csv if-statement

If you feel like installing awk, you can do 1,000,000 records in under a second - seems like a good optimisation to me :-)

awk -F, '   NR==1                    {f=NF; printf("Expecting: %d fields\n",f)}  # First record, get expected number of fields   NF!=f                    {print > "Fail.txt"; next}                  # Fail for wrong field count   length($0)-length($1)<40 {print > "Fail.txt"; next}                  # Fail for wrong length                            {print > "Pass.txt"}                        # Pass   ' MillionRecord.csv

You can get gawk for Windows from here.

Windows is a bit awkward with single quotes in parameters, so if running under Windows I would use the same code, but formatted like this:

Save this in a file called commands.awk:

NR==1                    {f=NF; printf("Expecting: %d fields\n",f)}NF!=f                    {print > "Fail.txt"; next}length($0)-length($1)<40 {print > "Fail.txt"; next}                         {print > "Pass.txt"}

Then run with:

awk -F, -f commands.awk Your.csv

The remainder of this answer relates to a "Beat hadoop with the shell" challenge mentioned in the comments section, and I wanted somewhere to save my code, so it's here.... runs in 6.002 seconds on my iMac over the 3.5GB in 1543 files amounting to around 104 million records:

#!/bin/bashdoit(){   awk '!/^\[Result/{next} /1-0/{w++;next} /0-1/{b++} END{print w,b}' $@}export -f doitfind . -name \*.pgn -print0 | parallel -0 -n 4 -j 12 doit {}

performance powershell csv if-statement

Try experimenting with different looping strategies, for example, switching to a for loop cuts the processing time by more than 50%, e.g.:

[String]                 $Local:file           = 'Your.csv';[String]                 $Local:path           = 'C:\temp';[System.Array]           $Local:csv            = $null;[System.IO.StreamWriter] $Local:objPassStream  = $null;[System.IO.StreamWriter] $Local:objFailStream  = $null; [Int32]                  $Local:intHeaderCount = 0;[Int32]                  $Local:intRow         = 0;[String]                 $Local:strRow         = '';[TimeSpan]               $Local:objMeasure     = 0;try {    # Load.    $objMeasure = Measure-Command {        $csv = Get-Content -LiteralPath (Join-Path -Path $path -ChildPath $file) -ErrorAction Stop;        $intHeaderCount = ($csv[0] -split ',').count;        } #measure-command    'Load took {0}ms' -f $objMeasure.TotalMilliseconds;    # Create stream writers.    try {        $objPassStream = New-Object -TypeName System.IO.StreamWriter ( '{0}\Passed{1}-pass.txt' -f $path, $file );        $objFailStream = New-Object -TypeName System.IO.StreamWriter ( '{0}\Failed{1}-fail.txt' -f $path, $file );        # Process CSV (v1).        $objMeasure = Measure-Command {            $csv | Select-Object -Skip 1 | Foreach-Object {                 if( (($_ -Split ',').Count -ge $intHeaderCount) -And (($_.Split(',',2)[1]).Length -ge 40) ) {                    $objPassStream.WriteLine( $_ );                   } else {                    $objFailStream.WriteLine( $_ );                } #else-if                } #foreach-object            } #measure-command        'Process took {0}ms' -f $objMeasure.TotalMilliseconds;        # Process CSV (v2).        $objMeasure = Measure-Command {            for ( $intRow = 1; $intRow -lt $csv.Count; $intRow++ ) {                if( (($csv[$intRow] -Split ',').Count -ge $intHeaderCount) -And (($csv[$intRow].Split(',',2)[1]).Length -ge 40) ) {                    $objPassStream.WriteLine( $csv[$intRow] );                   } else {                    $objFailStream.WriteLine( $csv[$intRow] );                } #else-if                } #for            } #measure-command        'Process took {0}ms' -f $objMeasure.TotalMilliseconds;        } #try    catch [System.Exception] {        'ERROR : Failed to create stream writers; exception was "{0}"' -f $_.Exception.Message;         } #catch    finally {        $objFailStream.close();        $objPassStream.close();            } #finally   } #trycatch [System.Exception] {    'ERROR : Failed to load CSV.';    } #catchexit 0;

CodeHunter

Speed of Powershell Script. Optimisation sought

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last