What unicode encoding (UTF-8, UTF-16, other) does Windows use for its Unicode data types? What unicode encoding (UTF-8, UTF-16, other) does Windows use for its Unicode data types? windows windows

What unicode encoding (UTF-8, UTF-16, other) does Windows use for its Unicode data types?


The values stored in memory for Windows are UTF-16 little-endian, always. But that's not what you're talking about - you're looking at file contents. Windows itself does not specify the encoding of files, it leaves that to individual applications.

The 0xfe 0xff you see at the start of the file is a Byte Order Mark or BOM. It not only indicates that the file is most probably Unicode, but it tells you which variant of Unicode encoding.

0xfe 0xff      UTF-16 big-endian0xff 0xfe      UTF-16 little-endian0xef 0xbb 0xbf UTF-8

A file that doesn't have a BOM should be assumed to be 8-bit characters unless you know how it was written. That still doesn't tell you if it's UTF-8 or some other Windows character encoding, you'll just have to guess.

You may use Notepad as an example of how this is done. If the file has a BOM then Notepad will read it and process the contents appropriately. Otherwise you must specify the coding yourself with the "Encoding" dropdown list.

Edit: the reason Windows documentation isn't more specific about the encoding is that Windows was a very early adopter of Unicode, and at the time there was only one encoding of 16 bits per code point. When 65536 code points were determined to be inadequate, surrogate pairs were invented as a way to extend the range and UTF-16 was born. Microsoft was already using Unicode to refer to their encoding and never changed.