How do you remove invalid hexadecimal characters from an XML-based data source prior to constructing an XmlReader or XPathDocument that uses the data?

c# xml validation encoding

It may not be perfect (emphasis added since people missing this disclaimer), but what I've done in that case is below. You can adjust to use with a stream.

/// <summary>/// Removes control characters and other non-UTF-8 characters/// </summary>/// <param name="inString">The string to process</param>/// <returns>A string with no control characters or entities above 0x00FD</returns>public static string RemoveTroublesomeCharacters(string inString){    if (inString == null) return null;    StringBuilder newString = new StringBuilder();    char ch;    for (int i = 0; i < inString.Length; i++)    {        ch = inString[i];        // remove any characters outside the valid UTF-8 range as well as all control characters        // except tabs and new lines        //if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r')        //if using .NET version prior to 4, use above logic        if (XmlConvert.IsXmlChar(ch)) //this method is new in .NET 4        {            newString.Append(ch);        }    }    return newString.ToString();}

c# xml validation encoding

I like Eugene's whitelist concept. I needed to do a similar thing as the original poster, but I needed to support all Unicode characters, not just up to 0x00FD. The XML spec is:

In .NET, the internal representation of Unicode characters is only 16 bits, so we can't `allow' 0x10000-0x10FFFF explicitly. The XML spec explicitly disallows the surrogate code points starting at 0xD800 from appearing. However it is possible that if we allowed these surrogate code points in our whitelist, utf-8 encoding our string might produce valid XML in the end as long as proper utf-8 encoding was produced from the surrogate pairs of utf-16 characters in the .NET string. I haven't explored this though, so I went with the safer bet and didn't allow the surrogates in my whitelist.

The comments in Eugene's solution are misleading though, the problem is that the characters we are excluding are not valid in XML ... they are perfectly valid Unicode code points. We are not removing `non-utf-8 characters'. We are removing utf-8 characters that may not appear in well-formed XML documents.

public static string XmlCharacterWhitelist( string in_string ) {    if( in_string == null ) return null;    StringBuilder sbOutput = new StringBuilder();    char ch;    for( int i = 0; i < in_string.Length; i++ ) {        ch = in_string[i];        if( ( ch >= 0x0020 && ch <= 0xD7FF ) ||             ( ch >= 0xE000 && ch <= 0xFFFD ) ||            ch == 0x0009 ||            ch == 0x000A ||             ch == 0x000D ) {            sbOutput.Append( ch );        }    }    return sbOutput.ToString();}

c# xml validation encoding

As the way to remove invalid XML characters I suggest you to use XmlConvert.IsXmlChar method. It was added since .NET Framework 4 and is presented in Silverlight too. Here is the small sample:

void Main() {    string content = "\v\f\0";    Console.WriteLine(IsValidXmlString(content)); // False    content = RemoveInvalidXmlChars(content);    Console.WriteLine(IsValidXmlString(content)); // True}static string RemoveInvalidXmlChars(string text) {    char[] validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray();    return new string(validXmlChars);}static bool IsValidXmlString(string text) {    try {        XmlConvert.VerifyXmlChars(text);        return true;    } catch {        return false;    }}

CodeHunter

How do you remove invalid hexadecimal characters from an XML-based data source prior to constructing an XmlReader or XPathDocument that uses the data?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last