Java : How to determine the correct charset encoding of a stream Java : How to determine the correct charset encoding of a stream java java

Java : How to determine the correct charset encoding of a stream


You cannot determine the encoding of a arbitrary byte stream. This is the nature of encodings. A encoding means a mapping between a byte value and its representation. So every encoding "could" be the right.

The getEncoding() method will return the encoding which was set up (read the JavaDoc) for the stream. It will not guess the encoding for you.

Some streams tell you which encoding was used to create them: XML, HTML. But not an arbitrary byte stream.

Anyway, you could try to guess an encoding on your own if you have to. Every language has a common frequency for every char. In English the char e appears very often but ê will appear very very seldom. In a ISO-8859-1 stream there are usually no 0x00 chars. But a UTF-16 stream has a lot of them.

Or: you could ask the user. I've already seen applications which present you a snippet of the file in different encodings and ask you to select the "correct" one.


check this out:http://site.icu-project.org/ (icu4j) they have libraries for detecting charset from IOStreamcould be simple like this:

BufferedInputStream bis = new BufferedInputStream(input);CharsetDetector cd = new CharsetDetector();cd.setText(bis);CharsetMatch cm = cd.detect();if (cm != null) {   reader = cm.getReader();   charset = cm.getName();}else {   throw new UnsupportedCharsetException()}