How can I replace non-printable Unicode characters in Java? How can I replace non-printable Unicode characters in Java? java java

How can I replace non-printable Unicode characters in Java?


my_string.replaceAll("\\p{C}", "?");

See more about Unicode regex. java.util.regexPattern/String.replaceAll supports them.


Op De Cirkel is mostly right. His suggestion will work in most cases:

myString.replaceAll("\\p{C}", "?");

But if myString might contain non-BMP codepoints then it's more complicated. \p{C} contains the surrogate codepoints of \p{Cs}. The replacement method above will corrupt non-BMP codepoints by sometimes replacing only half of the surrogate pair. It's possible this is a Java bug rather than intended behavior.

Using the other constituent categories is an option:

myString.replaceAll("[\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}]", "?");

However, solitary surrogate characters not part of a pair (each surrogate character has an assigned codepoint) will not be removed. A non-regex approach is the only way I know to properly handle \p{C}:

StringBuilder newString = new StringBuilder(myString.length());for (int offset = 0; offset < myString.length();){    int codePoint = myString.codePointAt(offset);    offset += Character.charCount(codePoint);    // Replace invisible control characters and unused code points    switch (Character.getType(codePoint))    {        case Character.CONTROL:     // \p{Cc}        case Character.FORMAT:      // \p{Cf}        case Character.PRIVATE_USE: // \p{Co}        case Character.SURROGATE:   // \p{Cs}        case Character.UNASSIGNED:  // \p{Cn}            newString.append('?');            break;        default:            newString.append(Character.toChars(codePoint));            break;    }}


You may be interested in the Unicode categories "Other, Control" and possibly "Other, Format" (unfortunately the latter seems to contain both unprintable and printable characters).

In Java regular expressions you can check for them using \p{Cc} and \p{Cf} respectively.