Java Clipboard: Paste HTML from Firefox on Linux
I belive the problem is related due to the fact that he read from clipboard as US-ASCII
, then convert to unicode and expect to leave German umlauts intact. As US-ASCII is a 7-bit charset German umlauts are not included and already lost after reading the clipboard as US-ASCII.
public class CharsetDemo { public static void main(String[] args) throws Exception { byte[] bytes; // convert the German umlaut to bytes in US-ASCII charset bytes = "ö".getBytes("US-ASCII"); System.out.println("US-ASCII"); System.out.println("bytes : " + asHexString(bytes)); System.out.println("string: " + new String(bytes, "US-ASCII")); System.out.println(); // create a unicode string from the US-ASCII bytes String utf8String = new String(bytes, "UTF-8"); bytes = utf8String.getBytes("UTF-8"); System.out.println("UTF-8"); System.out.println("bytes : " + asHexString(bytes)); System.out.println("string: " + utf8String); System.out.println(); // convert the German umlaut to bytes in ISO-8859-1 charset bytes = "ö".getBytes("ISO-8859-1"); System.out.println("ISO 8859-1"); System.out.println("bytes : " + asHexString(bytes)); System.out.println("string: " + new String(bytes, "ISO-8859-1")); System.out.println(); // create a unicode string from the ISO-8859-1 bytes utf8String = new String(bytes, "UTF-8"); bytes = utf8String.getBytes("UTF-8"); System.out.println("UTF-8"); System.out.println("bytes : " + asHexString(bytes)); System.out.println("string: " + utf8String); System.out.println(); // bytes of the "REPLACEMET CHARACTER" System.out.println("replacement character bytes: " + asHexString("\uFFFD".getBytes("UTF-8"))); } static String asHexString(byte[] bytes) { StringBuilder sb = new StringBuilder(); for (byte b : bytes) { sb.append(String.format("%X ", b)); } return sb.toString(); }}
output
US-ASCIIbytes : 3F string: ? <--- the question mark represents here the "REPLACEMENT CHARACTER"UTF-8bytes : 3F string: ?ISO 8859-1bytes : F6 string: öUTF-8bytes : EF BF BD <-- the "REPLACEMENT CHARACTER", as "F6" is not a valid UTF-8 codepointstring: �replacement character bytes: EF BF BD