How do you convert Html to plain text? How do you convert Html to plain text? asp.net asp.net

How do you convert Html to plain text?


The MIT licensed HtmlAgilityPack has in one of its samples a method that converts from HTML to plain text.

var plainText = HtmlUtilities.ConvertToPlainText(string html);

Feed it an HTML string like

<b>hello, <i>world!</i></b>

And you'll get a plain text result like:

hello world!


I could not use HtmlAgilityPack, so I wrote a second best solution for myself

private static string HtmlToPlainText(string html){    const string tagWhiteSpace = @"(>|$)(\W|\n|\r)+<";//matches one or more (white space or line breaks) between '>' and '<'    const string stripFormatting = @"<[^>]*(>|$)";//match any character between '<' and '>', even when end tag is missing    const string lineBreak = @"<(br|BR)\s{0,1}\/{0,1}>";//matches: <br>,<br/>,<br />,<BR>,<BR/>,<BR />    var lineBreakRegex = new Regex(lineBreak, RegexOptions.Multiline);    var stripFormattingRegex = new Regex(stripFormatting, RegexOptions.Multiline);    var tagWhiteSpaceRegex = new Regex(tagWhiteSpace, RegexOptions.Multiline);    var text = html;    //Decode html specific characters    text = System.Net.WebUtility.HtmlDecode(text);     //Remove tag whitespace/line breaks    text = tagWhiteSpaceRegex.Replace(text, "><");    //Replace <br /> with line breaks    text = lineBreakRegex.Replace(text, Environment.NewLine);    //Strip formatting    text = stripFormattingRegex.Replace(text, string.Empty);    return text;}


If you are talking about tag stripping, it is relatively straight forward if you don't have to worry about things like <script> tags. If all you need to do is display the text without the tags you can accomplish that with a regular expression:

<[^>]*>

If you do have to worry about <script> tags and the like then you'll need something a bit more powerful then regular expressions because you need to track state, omething more like a Context Free Grammar (CFG). Althought you might be able to accomplish it with 'Left To Right' or non-greedy matching.

If you can use regular expressions there are many web pages out there with good info:

If you need the more complex behaviour of a CFG I would suggest using a third party tool, unfortunately I don't know of a good one to recommend.