Speed of Powershell Script. Optimisation sought
Get-Content is extremely slow in the default mode that produces an array when the file contains millions of lines on all PowerShell versions, including 5.1. What's worse, you're assigning it to a variable so until the entire file is read and split into lines nothing else happens. On Intel i7 3770K CPU at 3.9GHz
$csv = Get-Content $path
takes more than 2 minutes to read a 350MB file with 8 million lines.Solution: Use
IO.StreamReader
to read a line and process it immediately.
In PowerShell2 StreamReader is less optimized than in PS3+ but still faster than Get-Content.
- Pipelining via
|
is at least several times slower than direct enumeration via flow control statements such aswhile
orforeach
statement (not cmdlet).
Solution: use the statements.
- Splitting each line into an array of strings is slower than manipulating only one string.
Solution: useIndexOf
andReplace
method (not operator) to count character occurrences.
- PowerShell always creates an internal pipeline when loops are used.
Solution: use theInvoke-Command { }
trick for 2-3x speedup in this case!
Below is PS2-compatible code.
It's faster in PS3+ (30 seconds for 8 million lines in a 350MB csv on my PC).
$reader = New-Object IO.StreamReader ('r:\data.csv', [Text.Encoding]::UTF8, $true, 4MB)$header = $reader.ReadLine()$numCol = $header.Split(',').count$writer1 = New-Object IO.StreamWriter ('r:\1.csv', $false, [Text.Encoding]::UTF8, 4MB)$writer2 = New-Object IO.StreamWriter ('r:\2.csv', $false, [Text.Encoding]::UTF8, 4MB)$writer1.WriteLine($header)$writer2.WriteLine($header)Write-Progress 'Filtering...' -status ' '$watch = [Diagnostics.Stopwatch]::StartNew()$currLine = 0Invoke-Command { # the speed-up trick: disables internal pipelinewhile (!$reader.EndOfStream) { $s = $reader.ReadLine() $slen = $s.length if ($slen-$s.IndexOf(',')-1 -ge 40 -and $slen-$s.Replace(',','').length+1 -eq $numCol){ $writer1.WriteLine($s) } else { $writer2.WriteLine($s) } if (++$currLine % 10000 -eq 0) { $pctDone = $reader.BaseStream.Position / $reader.BaseStream.Length Write-Progress 'Filtering...' -status "Line: $currLine" ` -PercentComplete ($pctDone * 100) ` -SecondsRemaining ($watch.ElapsedMilliseconds * (1/$pctDone - 1) / 1000) }}} #Invoke-Command endWrite-Progress 'Filtering...' -Completed -status ' 'echo "Elapsed $($watch.Elapsed)"$reader.close()$writer1.close()$writer2.close()
Another approach is to use regex in two passes (it's slower than the above code, though).
PowerShell 3 or newer is required due to array element property shorthand syntax:
$text = [IO.File]::ReadAllText('r:\data.csv')$header = $text.substring(0, $text.indexOfAny("`r`n"))$numCol = $header.split(',').count$rx = [regex]"\r?\n(?:[^,]*,){$($numCol-1)}[^,]*?(?=\r?\n|$)"[IO.File]::WriteAllText('r:\1.csv', $header + "`r`n" + ($rx.matches($text).groups.value -join "`r`n"))[IO.File]::WriteAllText('r:\2.csv', $header + "`r`n" + $rx.replace($text, ''))
If you feel like installing awk
, you can do 1,000,000 records in under a second - seems like a good optimisation to me :-)
awk -F, ' NR==1 {f=NF; printf("Expecting: %d fields\n",f)} # First record, get expected number of fields NF!=f {print > "Fail.txt"; next} # Fail for wrong field count length($0)-length($1)<40 {print > "Fail.txt"; next} # Fail for wrong length {print > "Pass.txt"} # Pass ' MillionRecord.csv
You can get gawk
for Windows from here.
Windows is a bit awkward with single quotes in parameters, so if running under Windows I would use the same code, but formatted like this:
Save this in a file called commands.awk
:
NR==1 {f=NF; printf("Expecting: %d fields\n",f)}NF!=f {print > "Fail.txt"; next}length($0)-length($1)<40 {print > "Fail.txt"; next} {print > "Pass.txt"}
Then run with:
awk -F, -f commands.awk Your.csv
The remainder of this answer relates to a "Beat hadoop with the shell" challenge mentioned in the comments section, and I wanted somewhere to save my code, so it's here.... runs in 6.002 seconds on my iMac over the 3.5GB in 1543 files amounting to around 104 million records:
#!/bin/bashdoit(){ awk '!/^\[Result/{next} /1-0/{w++;next} /0-1/{b++} END{print w,b}' $@}export -f doitfind . -name \*.pgn -print0 | parallel -0 -n 4 -j 12 doit {}
Try experimenting with different looping strategies, for example, switching to a for loop cuts the processing time by more than 50%, e.g.:
[String] $Local:file = 'Your.csv';[String] $Local:path = 'C:\temp';[System.Array] $Local:csv = $null;[System.IO.StreamWriter] $Local:objPassStream = $null;[System.IO.StreamWriter] $Local:objFailStream = $null; [Int32] $Local:intHeaderCount = 0;[Int32] $Local:intRow = 0;[String] $Local:strRow = '';[TimeSpan] $Local:objMeasure = 0;try { # Load. $objMeasure = Measure-Command { $csv = Get-Content -LiteralPath (Join-Path -Path $path -ChildPath $file) -ErrorAction Stop; $intHeaderCount = ($csv[0] -split ',').count; } #measure-command 'Load took {0}ms' -f $objMeasure.TotalMilliseconds; # Create stream writers. try { $objPassStream = New-Object -TypeName System.IO.StreamWriter ( '{0}\Passed{1}-pass.txt' -f $path, $file ); $objFailStream = New-Object -TypeName System.IO.StreamWriter ( '{0}\Failed{1}-fail.txt' -f $path, $file ); # Process CSV (v1). $objMeasure = Measure-Command { $csv | Select-Object -Skip 1 | Foreach-Object { if( (($_ -Split ',').Count -ge $intHeaderCount) -And (($_.Split(',',2)[1]).Length -ge 40) ) { $objPassStream.WriteLine( $_ ); } else { $objFailStream.WriteLine( $_ ); } #else-if } #foreach-object } #measure-command 'Process took {0}ms' -f $objMeasure.TotalMilliseconds; # Process CSV (v2). $objMeasure = Measure-Command { for ( $intRow = 1; $intRow -lt $csv.Count; $intRow++ ) { if( (($csv[$intRow] -Split ',').Count -ge $intHeaderCount) -And (($csv[$intRow].Split(',',2)[1]).Length -ge 40) ) { $objPassStream.WriteLine( $csv[$intRow] ); } else { $objFailStream.WriteLine( $csv[$intRow] ); } #else-if } #for } #measure-command 'Process took {0}ms' -f $objMeasure.TotalMilliseconds; } #try catch [System.Exception] { 'ERROR : Failed to create stream writers; exception was "{0}"' -f $_.Exception.Message; } #catch finally { $objFailStream.close(); $objPassStream.close(); } #finally } #trycatch [System.Exception] { 'ERROR : Failed to load CSV.'; } #catchexit 0;