removing invalid XML characters from a string in java

Java's regex supports supplementary characters, so you can specify those high ranges with two UTF-16 encoded chars.

Here is the pattern for removing characters that are illegal in XML 1.0:

// XML 1.0// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]String xml10pattern = "[^"                    + "\u0009\r\n"                    + "\u0020-\uD7FF"                    + "\uE000-\uFFFD"                    + "\ud800\udc00-\udbff\udfff"                    + "]";

Most people will want the XML 1.0 version.

Here is the pattern for removing characters that are illegal in XML 1.1:

// XML 1.1// [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]String xml11pattern = "[^"                    + "\u0001-\uD7FF"                    + "\uE000-\uFFFD"                    + "\ud800\udc00-\udbff\udfff"                    + "]+";

You will need to use String.replaceAll(...) and not String.replace(...).

String illegal = "Hello, World!\0";String legal = illegal.replaceAll(pattern, "");

java xml regex invalid-characters

Should we consider surrogate characters? otherwise '(current >= 0x10000) && (current <= 0x10FFFF)' will never be true.

Also tested that the regex way seems slower than the following loop.

if (null == text || text.isEmpty()) {    return text;}final int len = text.length();char current = 0;int codePoint = 0;StringBuilder sb = new StringBuilder();for (int i = 0; i < len; i++) {    current = text.charAt(i);    boolean surrogate = false;    if (Character.isHighSurrogate(current)            && i + 1 < len && Character.isLowSurrogate(text.charAt(i + 1))) {        surrogate = true;        codePoint = text.codePointAt(i++);    } else {        codePoint = current;    }    if ((codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD)            || ((codePoint >= 0x20) && (codePoint <= 0xD7FF))            || ((codePoint >= 0xE000) && (codePoint <= 0xFFFD))            || ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) {        sb.append(current);        if (surrogate) {            sb.append(text.charAt(i));        }    }}

java xml regex invalid-characters

All these answers so far only replace the characters themselves. But sometimes an XML document will have invalid XML entity sequences resulting in errors. For example, if you have  in your xml, a java xml parser will throw Illegal character entity: expansion character (code 0x2 at ....

Here is a simple java program that can replace those invalid entity sequences.

  public final Pattern XML_ENTITY_PATTERN = Pattern.compile("\\&\\#(?:x([0-9a-fA-F]+)|([0-9]+))\\;");  /**   * Remove problematic xml entities from the xml string so that you can parse it with java DOM / SAX libraries.   */  String getCleanedXml(String xmlString) {    Matcher m = XML_ENTITY_PATTERN.matcher(xmlString);    Set<String> replaceSet = new HashSet<>();    while (m.find()) {      String group = m.group(1);      int val;      if (group != null) {        val = Integer.parseInt(group, 16);        if (isInvalidXmlChar(val)) {          replaceSet.add("&#x" + group + ";");        }      } else if ((group = m.group(2)) != null) {        val = Integer.parseInt(group);        if (isInvalidXmlChar(val)) {          replaceSet.add("&#" + group + ";");        }      }    }    String cleanedXmlString = xmlString;    for (String replacer : replaceSet) {      cleanedXmlString = cleanedXmlString.replaceAll(replacer, "");    }    return cleanedXmlString;  }  private boolean isInvalidXmlChar(int val) {    if (val == 0x9 || val == 0xA || val == 0xD ||            val >= 0x20 && val <= 0xD7FF ||            val >= 0x10000 && val <= 0x10FFFF) {      return false;    }    return true;  }

CodeHunter

removing invalid XML characters from a string in java

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last