PowerShell is slow (much slower than Python) in large Search/Replace operation?
Give this PowerShell script a try. It should perform much better. Much less use of RAM too as the file is read in a buffered stream.
$reader = [IO.File]::OpenText("C:\input.csv")$writer = New-Object System.IO.StreamWriter("C:\output.csv")while ($reader.Peek() -ge 0) { $line = $reader.ReadLine() $line2 = $line -replace $SearchStr, $ReplaceStr $writer.writeline($line2)}$reader.Close()$writer.Close()
This processes one file, but you can test performance with it and if its more acceptable add it to a loop.
Alternatively you can use Get-Content
to read a number of lines into memory, perform the replacement and then write the updated chunk utilizing the PowerShell pipeline.
Get-Content "C:\input.csv" -ReadCount 512 | % { $_ -replace $SearchStr, $ReplaceStr} | Set-Content "C:\output.csv"
To squeeze a little more performance you can also compile the regex (-replace
uses regular expressions) like this:
$re = New-Object Regex $SearchStr, 'Compiled'$re.Replace( $_ , $ReplaceStr )
I see this a lot:
$content | foreach {$_ -replace $SearchStr, $ReplaceStr}
The -replace operator will handle an entire array at once:
$content -replace $SearchStr, $ReplaceStr
and do it a lot faster than iterating through one element at a time. I suspect doing that may get you closer to an apples-to-apples comparison.
I don't know Python, but it looks like you are doing literal string replacements in the Python script. In Powershell, the -replace
operator is a regular expression search/replace. I would convert the Powershell to using the replace method on the string class (or to answer the original question, I think your Powershell is inefficient).
ForEach ($file in Get-ChildItem C:\temp\csv\*.csv) { $content = Get-Content -path $file # look close, not much changes $content | foreach {$_.Replace($SearchStr, $ReplaceStr)} | Set-Content $file}
EDIT Upon further review, I think I see another (perhaps more important) difference in the versions. The Python version appears to be reading the entire file into a single string. The Powershell version on the other hand is reading into an array of strings.
The help on Get-Content
mentions a ReadCount
parameter that can affect the performance. Setting this count to -1 seems to read the entire file into a single array. This will mean that you are passing an array through the pipeline instead of individual strings, but a simple change to the code will deal with that:
# $content is now an array$content | % { $_ } | % {$_.Replace($SearchStr, $ReplaceStr)} | Set-Content $file
If you want to read the entire file into a single string like the Python version seems to, just call the .NET method directly:
# now you have to make sure to use a FULL RESOLVED PATH$content = [System.IO.File]::ReadAllText($file.FullName) $content.Replace($SearchStr, $ReplaceStr) | Set-Content $file
This is not quite as "Powershell-y" since you use the .NET APIs directly instead of the similar cmdlets, but they put the ability in there for times when you need it.