How to read HTML as XML? How to read HTML as XML? xml xml

How to read HTML as XML?


HTML simply isn’t the same as XML (unless the HTML actually happens to be conforming XHTML or HTML5 in XML mode). The best way is to use a HTML parser to read the HTML. Afterwards you may transform it to Linq to XML – or process it directly.


I haven't used it myself, but I suggest you take a look at SGMLReader. Here's a sample from their home page:

XmlDocument FromHtml(TextReader reader) {    // setup SgmlReader    Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();    sgmlReader.DocType = "HTML";    sgmlReader.WhitespaceHandling = WhitespaceHandling.All;    sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;    sgmlReader.InputStream = reader;    // create document    XmlDocument doc = new XmlDocument();    doc.PreserveWhitespace = true;    doc.XmlResolver = null;    doc.Load(sgmlReader);    return doc;}


If you want to extract some links from a page, as you mentioned, try using HTML Agility Pack.

This code gets a page from the web and extracts all links:

HtmlWeb web = new HtmlWeb();  HtmlDocument document = web.Load("http://www.stackoverflow.com");  HtmlNode[] links = document.DocumentNode.SelectNodes("//a").ToArray(); 

Open an html file from disk and get URL for specific link:

HtmlDocument document2 = new HtmlDocument();  document2.Load(@"C:\Temp\page.html")  HtmlNode link = document2.DocumentNode.SelectSingleNode("//a[@id='myLink']");Console.WriteLine(link.Attributes["href"].Value);