Powershell remove HTML tags in string content Powershell remove HTML tags in string content powershell powershell

Powershell remove HTML tags in string content


For a pure regex, it should be as easy as <[^>]+>:

$string -replace '<[^>]+>',''

Regular expression visualization

Debuggex Demo

Note that this could fail with certain HTML comments or the contents of <pre> tags.

Instead, you could use the HTML Agility Pack (alternative link), which is designed for use in .Net code, and I've used it successfully in PowerShell before:

Add-Type -Path 'C:\packages\HtmlAgilityPack.1.4.6\lib\Net40-client\HtmlAgilityPack.dll'$doc = New-Object HtmlAgilityPack.HtmlDocument$doc.LoadHtml($string)$doc.DocumentNode.InnerText

HTML Agility Pack works well with non-perfect HTML.


You can try this:

$string -replace '<.*?>',''


To resolve umlauts and special characters I used a html Object. Here is my function:

Function ConvertFrom-Html{    <#        .SYNOPSIS            Converts a HTML-String to plaintext.        .DESCRIPTION            Creates a HtmlObject Com object und uses innerText to get plaintext.             If that makes an error it replaces several HTML-SpecialChar-Placeholders and removes all <>-Tags via RegEx.        .INPUTS            String. HTML als String        .OUTPUTS            String. HTML-Text als Plaintext        .EXAMPLE        $html = "<p><strong>Nutzen:</strong></p><p>Der Nutzen ist überaus groß.<br />Test ob 3 < als 5 & "4" > &apos;2&apos; it?"        ConvertFrom-Html -Html $html        $html | ConvertFrom-Html        Result:        "Nutzen:        Der Nutzen ist überaus groß.        Test ob 3 < als 5 ist & "4" > '2'?"        .Notes            Author: Ludwig Fichtinger FILU            Inital Creation Date: 01.06.2021            ChangeLog: v2 20.08.2021 try catch with replace for systems without Internet Explorer    #>    [CmdletBinding(SupportsShouldProcess = $True)]    Param(        [Parameter(Mandatory = $true, Position = 0, ValueFromPipeline = $true, HelpMessage = "HTML als String")]        [AllowEmptyString()]        [string]$Html    )    try    {        $HtmlObject = New-Object -Com "HTMLFile"        $HtmlObject.IHTMLDocument2_write($Html)        $PlainText = $HtmlObject.documentElement.innerText    }    catch    {        $nl = [System.Environment]::NewLine        $PlainText = $Html -replace '<br>',$nl        $PlainText = $PlainText -replace '<br/>',$nl        $PlainText = $PlainText -replace '<br />',$nl        $PlainText = $PlainText -replace '</p>',$nl        $PlainText = $PlainText -replace ' ',' '        $PlainText = $PlainText -replace 'Ä','Ä'        $PlainText = $PlainText -replace 'ä','ä'        $PlainText = $PlainText -replace 'Ö','Ö'        $PlainText = $PlainText -replace 'ö','ö'        $PlainText = $PlainText -replace 'Ü','Ü'        $PlainText = $PlainText -replace 'ü','ü'        $PlainText = $PlainText -replace 'ß','ß'        $PlainText = $PlainText -replace '&','&'        $PlainText = $PlainText -replace '"','"'        $PlainText = $PlainText -replace '&apos;',"'"        $PlainText = $PlainText -replace '<.*?>',''        $PlainText = $PlainText -replace '>','>'        $PlainText = $PlainText -replace '<','<'    }    return $PlainText}

Example:

"<p><strong>Nutzen:</strong></p><p>Der Nutzen ist überaus groß.<br />Test ob 3 < als 5 ist & "4" > &apos;2&apos;?" | ConvertFrom-Html

Result:

Nutzen:Der Nutzen ist überaus groß.Test ob 3 < als 5 ist & "4" > '2'?