removing invalid XML characters from a string in java removing invalid XML characters from a string in java xml xml

removing invalid XML characters from a string in java


Java's regex supports supplementary characters, so you can specify those high ranges with two UTF-16 encoded chars.

Here is the pattern for removing characters that are illegal in XML 1.0:

// XML 1.0// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]String xml10pattern = "[^"                    + "\u0009\r\n"                    + "\u0020-\uD7FF"                    + "\uE000-\uFFFD"                    + "\ud800\udc00-\udbff\udfff"                    + "]";

Most people will want the XML 1.0 version.

Here is the pattern for removing characters that are illegal in XML 1.1:

// XML 1.1// [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]String xml11pattern = "[^"                    + "\u0001-\uD7FF"                    + "\uE000-\uFFFD"                    + "\ud800\udc00-\udbff\udfff"                    + "]+";

You will need to use String.replaceAll(...) and not String.replace(...).

String illegal = "Hello, World!\0";String legal = illegal.replaceAll(pattern, "");


Should we consider surrogate characters? otherwise '(current >= 0x10000) && (current <= 0x10FFFF)' will never be true.

Also tested that the regex way seems slower than the following loop.

if (null == text || text.isEmpty()) {    return text;}final int len = text.length();char current = 0;int codePoint = 0;StringBuilder sb = new StringBuilder();for (int i = 0; i < len; i++) {    current = text.charAt(i);    boolean surrogate = false;    if (Character.isHighSurrogate(current)            && i + 1 < len && Character.isLowSurrogate(text.charAt(i + 1))) {        surrogate = true;        codePoint = text.codePointAt(i++);    } else {        codePoint = current;    }    if ((codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD)            || ((codePoint >= 0x20) && (codePoint <= 0xD7FF))            || ((codePoint >= 0xE000) && (codePoint <= 0xFFFD))            || ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) {        sb.append(current);        if (surrogate) {            sb.append(text.charAt(i));        }    }}


All these answers so far only replace the characters themselves. But sometimes an XML document will have invalid XML entity sequences resulting in errors. For example, if you have &#2; in your xml, a java xml parser will throw Illegal character entity: expansion character (code 0x2 at ....

Here is a simple java program that can replace those invalid entity sequences.

  public final Pattern XML_ENTITY_PATTERN = Pattern.compile("\\&\\#(?:x([0-9a-fA-F]+)|([0-9]+))\\;");  /**   * Remove problematic xml entities from the xml string so that you can parse it with java DOM / SAX libraries.   */  String getCleanedXml(String xmlString) {    Matcher m = XML_ENTITY_PATTERN.matcher(xmlString);    Set<String> replaceSet = new HashSet<>();    while (m.find()) {      String group = m.group(1);      int val;      if (group != null) {        val = Integer.parseInt(group, 16);        if (isInvalidXmlChar(val)) {          replaceSet.add("&#x" + group + ";");        }      } else if ((group = m.group(2)) != null) {        val = Integer.parseInt(group);        if (isInvalidXmlChar(val)) {          replaceSet.add("&#" + group + ";");        }      }    }    String cleanedXmlString = xmlString;    for (String replacer : replaceSet) {      cleanedXmlString = cleanedXmlString.replaceAll(replacer, "");    }    return cleanedXmlString;  }  private boolean isInvalidXmlChar(int val) {    if (val == 0x9 || val == 0xA || val == 0xD ||            val >= 0x20 && val <= 0xD7FF ||            val >= 0x10000 && val <= 0x10FFFF) {      return false;    }    return true;  }