Java - Count exactly 60 characters from a string with a mixture of UTF-8 and non UTF-8 characters
As far as I understand you want to limit the String
length in a way that the encoded UTF-8
representation does not exceed 60 bytes. You can do it this way:
String s=…;CharsetEncoder enc=StandardCharsets.UTF_8.newEncoder();ByteBuffer bb=ByteBuffer.allocate(60);// note the limitCharBuffer cb = CharBuffer.wrap(s);CoderResult r = enc.encode(cb, bb, true);if(r.isOverflow()) { System.out.println(s+" is too long for " +bb.capacity()+" "+enc.charset()+" bytes"); s=cb.flip().toString(); System.out.println("truncated to "+s);}
This is my quick hack: a function to truncate a string to given number of bytes in UTF-8 encoding:
public static String truncateUtf8(String original, int byteCount) { if (original.length() * 3 <= byteCount) { return original; } StringBuilder sb = new StringBuilder(); int count = 0; for (int i = 0; i < original.length(); i++) { char c = original.charAt(i); int newCount; if (c <= 0x7f) newCount = count + 1; else if (c <= 0x7ff) newCount = count + 2; else newCount = count + 3; if (newCount > byteCount) { break; } count = newCount; sb.append(c); } return sb.toString();}
It does not work as expected for characters outside of BMP – counts them as 6 bytes instead of 4. It may also break grapheme clusters. But for most simple tasks it should be OK.
truncateUtf8("e", 1) => "e"truncateUtf8("ée", 1) => ""truncateUtf8("ée", 2) => "é"truncateUtf8("ée", 3) => "ée"