Fastest way to strip all non-printable characters from a Java String Fastest way to strip all non-printable characters from a Java String java java

Fastest way to strip all non-printable characters from a Java String


using 1 char array could work a bit better

int length = s.length();char[] oldChars = new char[length];s.getChars(0, length, oldChars, 0);int newLen = 0;for (int j = 0; j < length; j++) {    char ch = oldChars[j];    if (ch >= ' ') {        oldChars[newLen] = ch;        newLen++;    }}s = new String(oldChars, 0, newLen);

and I avoided repeated calls to s.length();

another micro-optimization that might work is

int length = s.length();char[] oldChars = new char[length+1];s.getChars(0, length, oldChars, 0);oldChars[length]='\0';//avoiding explicit bound check in whileint newLen=-1;while(oldChars[++newLen]>=' ');//find first non-printable,                       // if there are none it ends on the null char I appendedfor (int  j = newLen; j < length; j++) {    char ch = oldChars[j];    if (ch >= ' ') {        oldChars[newLen] = ch;//the while avoids repeated overwriting here when newLen==j        newLen++;    }}s = new String(oldChars, 0, newLen);


If it is reasonable to embed this method in a class which is not shared across threads, then you can reuse the buffer:

char [] oldChars = new char[5];String stripControlChars(String s){    final int inputLen = s.length();    if ( oldChars.length < inputLen )    {        oldChars = new char[inputLen];    }    s.getChars(0, inputLen, oldChars, 0);

etc...

This is a big win - 20% or so, as I understand the current best case.

If this is to be used on potentially large strings and the memory "leak" is a concern, a weak reference can be used.


Well I've beaten the current best method (freak's solution with the preallocated array) by about 30% according to my measures. How? By selling my soul.

As I'm sure everyone that has followed the discussion so far knows this violates pretty much any basic programming principle, but oh well. Anyways the following only works if the used character array of the string isn't shared between other strings - if it does whoever has to debug this will have every right deciding to kill you (without calls to substring() and using this on literal strings this should work as I don't see why the JVM would intern unique strings read from an outside source). Though don't forget to make sure the benchmark code doesn't do it - that's extremely likely and would help the reflection solution obviously.

Anyways here we go:

    // Has to be done only once - so cache those! Prohibitively expensive otherwise    private Field value;    private Field offset;    private Field count;    private Field hash;    {        try {            value = String.class.getDeclaredField("value");            value.setAccessible(true);            offset = String.class.getDeclaredField("offset");            offset.setAccessible(true);            count = String.class.getDeclaredField("count");            count.setAccessible(true);            hash = String.class.getDeclaredField("hash");            hash.setAccessible(true);                       }        catch (NoSuchFieldException e) {            throw new RuntimeException();        }    }    @Override    public String strip(final String old) {        final int length = old.length();        char[] chars = null;        int off = 0;        try {            chars = (char[]) value.get(old);            off = offset.getInt(old);        }        catch(IllegalArgumentException e) {            throw new RuntimeException(e);        }        catch(IllegalAccessException e) {            throw new RuntimeException(e);        }        int newLen = off;        for(int j = off; j < off + length; j++) {            final char ch = chars[j];            if (ch >= ' ') {                chars[newLen] = ch;                newLen++;            }        }        if (newLen - off != length) {            // We changed the internal state of the string, so at least            // be friendly enough to correct it.            try {                count.setInt(old, newLen - off);                // Have to recompute hash later on                hash.setInt(old, 0);            }            catch(IllegalArgumentException e) {                e.printStackTrace();            }            catch(IllegalAccessException e) {                e.printStackTrace();            }        }        // Well we have to return something        return old;    }

For my teststring that gets 3477148.18ops/s vs. 2616120.89ops/s for the old variant. I'm quite sure the only way to beat that could be to write it in C (probably not though) or some completely different approach nobody has thought about so far. Though I'm absolutely not sure if the timing is stable across different platforms - produces reliable results on my box (Java7, Win7 x64) at least.