Java - Count exactly 60 characters from a string with a mixture of UTF-8 and non UTF-8 characters Java - Count exactly 60 characters from a string with a mixture of UTF-8 and non UTF-8 characters oracle oracle

Java - Count exactly 60 characters from a string with a mixture of UTF-8 and non UTF-8 characters


As far as I understand you want to limit the String length in a way that the encoded UTF-8 representation does not exceed 60 bytes. You can do it this way:

String s=…;CharsetEncoder enc=StandardCharsets.UTF_8.newEncoder();ByteBuffer bb=ByteBuffer.allocate(60);// note the limitCharBuffer cb = CharBuffer.wrap(s);CoderResult r = enc.encode(cb, bb, true);if(r.isOverflow()) {    System.out.println(s+" is too long for "                      +bb.capacity()+" "+enc.charset()+" bytes");    s=cb.flip().toString();    System.out.println("truncated to "+s);}


This is my quick hack: a function to truncate a string to given number of bytes in UTF-8 encoding:

public static String truncateUtf8(String original, int byteCount) {    if (original.length() * 3 <= byteCount) {        return original;    }    StringBuilder sb = new StringBuilder();    int count = 0;    for (int i = 0; i < original.length(); i++) {        char c = original.charAt(i);        int newCount;        if (c <= 0x7f) newCount = count + 1;        else if (c <= 0x7ff) newCount = count + 2;        else newCount = count + 3;        if (newCount > byteCount) {            break;        }        count = newCount;        sb.append(c);    }    return sb.toString();}

It does not work as expected for characters outside of BMP – counts them as 6 bytes instead of 4. It may also break grapheme clusters. But for most simple tasks it should be OK.

truncateUtf8("e", 1) => "e"truncateUtf8("ée", 1) => ""truncateUtf8("ée", 2) => "é"truncateUtf8("ée", 3) => "ée"