Powershell remove HTML tags in string content
For a pure regex, it should be as easy as <[^>]+>
:
$string -replace '<[^>]+>',''
Note that this could fail with certain HTML comments or the contents of <pre>
tags.
Instead, you could use the HTML Agility Pack (alternative link), which is designed for use in .Net code, and I've used it successfully in PowerShell before:
Add-Type -Path 'C:\packages\HtmlAgilityPack.1.4.6\lib\Net40-client\HtmlAgilityPack.dll'$doc = New-Object HtmlAgilityPack.HtmlDocument$doc.LoadHtml($string)$doc.DocumentNode.InnerText
HTML Agility Pack works well with non-perfect HTML.
To resolve umlauts and special characters I used a html Object. Here is my function:
Function ConvertFrom-Html{ <# .SYNOPSIS Converts a HTML-String to plaintext. .DESCRIPTION Creates a HtmlObject Com object und uses innerText to get plaintext. If that makes an error it replaces several HTML-SpecialChar-Placeholders and removes all <>-Tags via RegEx. .INPUTS String. HTML als String .OUTPUTS String. HTML-Text als Plaintext .EXAMPLE $html = "<p><strong>Nutzen:</strong></p><p>Der Nutzen ist überaus groß.<br />Test ob 3 < als 5 & "4" > '2' it?" ConvertFrom-Html -Html $html $html | ConvertFrom-Html Result: "Nutzen: Der Nutzen ist überaus groß. Test ob 3 < als 5 ist & "4" > '2'?" .Notes Author: Ludwig Fichtinger FILU Inital Creation Date: 01.06.2021 ChangeLog: v2 20.08.2021 try catch with replace for systems without Internet Explorer #> [CmdletBinding(SupportsShouldProcess = $True)] Param( [Parameter(Mandatory = $true, Position = 0, ValueFromPipeline = $true, HelpMessage = "HTML als String")] [AllowEmptyString()] [string]$Html ) try { $HtmlObject = New-Object -Com "HTMLFile" $HtmlObject.IHTMLDocument2_write($Html) $PlainText = $HtmlObject.documentElement.innerText } catch { $nl = [System.Environment]::NewLine $PlainText = $Html -replace '<br>',$nl $PlainText = $PlainText -replace '<br/>',$nl $PlainText = $PlainText -replace '<br />',$nl $PlainText = $PlainText -replace '</p>',$nl $PlainText = $PlainText -replace ' ',' ' $PlainText = $PlainText -replace 'Ä','Ä' $PlainText = $PlainText -replace 'ä','ä' $PlainText = $PlainText -replace 'Ö','Ö' $PlainText = $PlainText -replace 'ö','ö' $PlainText = $PlainText -replace 'Ü','Ü' $PlainText = $PlainText -replace 'ü','ü' $PlainText = $PlainText -replace 'ß','ß' $PlainText = $PlainText -replace '&','&' $PlainText = $PlainText -replace '"','"' $PlainText = $PlainText -replace ''',"'" $PlainText = $PlainText -replace '<.*?>','' $PlainText = $PlainText -replace '>','>' $PlainText = $PlainText -replace '<','<' } return $PlainText}
Example:
"<p><strong>Nutzen:</strong></p><p>Der Nutzen ist überaus groß.<br />Test ob 3 < als 5 ist & "4" > '2'?" | ConvertFrom-Html
Result:
Nutzen:Der Nutzen ist überaus groß.Test ob 3 < als 5 ist & "4" > '2'?