using powershell to replace extended ascii character in a text file using powershell to replace extended ascii character in a text file powershell powershell

using powershell to replace extended ascii character in a text file


In Windows PowerShell, the default character encoding when reading from / writing to[1] files is "ANSI", i.e., the legacy 8-bit code page implied by the active system locale.
(By contrast, PowerShell Core defaults to UTF-8.)

For instance, the code page associated with the system locale on an US-English system is 1252, i.e., Windows-1252, where code point 0x93 is the non-ASCII quotation mark.

Howere, once a text file's content has been read into memory, in memory a string's characters are represented as UTF-16LE code units, i.e., as .NET [string] instances.

As a Unicode character, has code point U+201c, expressed as 0x201c in UTF-16LE.

Therefore - because in memory all strings are UTF-16LE code units - what you need to replace is [char] 0x201c:

$q1 = [char] 0x201c  # “Get-ChildItem *.csv -Recurse | ForEach-Object {  (Get-Content $_.FullName) -replace $q1, '""' | Set-Content $_.FullName}

Note that Set-Content too uses the default character encoding, so the rewritten files will use "ANSI" encoding too - use the -Encoding parameter to change the output encoding, if desired.

Also note the (...) around the Get-Content call, which ensures that the input file i read into memory in full up front, which enables writing back to the same file in the same pipeline.
While this approach is convenient, note that it bears a slight risk of data loss if writing back to the input file is interrupted before completion.


Converting an "ANSI" code point to a Unicode code point

The following shows how an "ANSI" (8-bit) code point such as 0x93 can be converted to its equivalent UTF-16 code point, 0x201c:

# Convert an array of "ANSI" code points (1 byte each) to the UTF-16# string they represent. # Note: In Windows PowerShell, [Text.Encoding]::Default contains#       the "ANSI" encoding set by the system locale.$str = [Text.Encoding]::Default.GetString([byte[]] 0x93) # -> '“'# Get the UTF-16 code points of the characters making up the string.$codePoints = [int[]] [char[]] $str# Format the first and only code point as a hex. number.'0x{0:x}' -f $codePoints[0]  # -> '0x201c'

[1] Writing files with Set-Content, that is; using Out-File / >, by contrast, creates UTF-16LE ("Unicode") files. The cmdlets in Windows PowerShell display a bewildering array of differing encodings: see this answer. Fortunately, PowerShell Core now consistently defaults to (BOM-less) UTF-8.