PowerShell is slow (much slower than Python) in large Search/Replace operation? PowerShell is slow (much slower than Python) in large Search/Replace operation? python python

PowerShell is slow (much slower than Python) in large Search/Replace operation?


Give this PowerShell script a try. It should perform much better. Much less use of RAM too as the file is read in a buffered stream.

$reader = [IO.File]::OpenText("C:\input.csv")$writer = New-Object System.IO.StreamWriter("C:\output.csv")while ($reader.Peek() -ge 0) {    $line = $reader.ReadLine()    $line2 = $line -replace $SearchStr, $ReplaceStr    $writer.writeline($line2)}$reader.Close()$writer.Close()

This processes one file, but you can test performance with it and if its more acceptable add it to a loop.

Alternatively you can use Get-Content to read a number of lines into memory, perform the replacement and then write the updated chunk utilizing the PowerShell pipeline.

Get-Content "C:\input.csv" -ReadCount 512 | % {    $_ -replace $SearchStr, $ReplaceStr} | Set-Content "C:\output.csv"

To squeeze a little more performance you can also compile the regex (-replace uses regular expressions) like this:

$re = New-Object Regex $SearchStr, 'Compiled'$re.Replace( $_ , $ReplaceStr )


I see this a lot:

$content | foreach {$_ -replace $SearchStr, $ReplaceStr} 

The -replace operator will handle an entire array at once:

$content -replace $SearchStr, $ReplaceStr

and do it a lot faster than iterating through one element at a time. I suspect doing that may get you closer to an apples-to-apples comparison.


I don't know Python, but it looks like you are doing literal string replacements in the Python script. In Powershell, the -replace operator is a regular expression search/replace. I would convert the Powershell to using the replace method on the string class (or to answer the original question, I think your Powershell is inefficient).

ForEach ($file in Get-ChildItem C:\temp\csv\*.csv) {    $content = Get-Content -path $file    # look close, not much changes    $content | foreach {$_.Replace($SearchStr, $ReplaceStr)} | Set-Content $file}

EDIT Upon further review, I think I see another (perhaps more important) difference in the versions. The Python version appears to be reading the entire file into a single string. The Powershell version on the other hand is reading into an array of strings.

The help on Get-Content mentions a ReadCount parameter that can affect the performance. Setting this count to -1 seems to read the entire file into a single array. This will mean that you are passing an array through the pipeline instead of individual strings, but a simple change to the code will deal with that:

# $content is now an array$content | % { $_ } | % {$_.Replace($SearchStr, $ReplaceStr)} | Set-Content $file

If you want to read the entire file into a single string like the Python version seems to, just call the .NET method directly:

# now you have to make sure to use a FULL RESOLVED PATH$content = [System.IO.File]::ReadAllText($file.FullName) $content.Replace($SearchStr, $ReplaceStr) | Set-Content $file

This is not quite as "Powershell-y" since you use the .NET APIs directly instead of the similar cmdlets, but they put the ability in there for times when you need it.