Why is it that UTF-8 encoding is used when interacting with a UNIX/Linux environment? Why is it that UTF-8 encoding is used when interacting with a UNIX/Linux environment? unix unix

Why is it that UTF-8 encoding is used when interacting with a UNIX/Linux environment?


Partly because the file systems expect NUL ('\0') bytes to terminate file names, so UTF-16 would not work well. You'd have to modify a lot of code to make that change.


As jonathan-leffler mentions, the prime issue is the ASCII null character. C traditionally expects a string to be null terminated. So standard C string functions will choke on any UTF-16 character containing a byte equivalent to an ASCII null (0x00). While you can certainly program with wide character support, UTF-16 is not a suitable external encoding of Unicode in filenames, text files, environment variables.

Furthermore, UTF-16 and UTF-32 have both big endian and little endian orientations. To deal with this, you'll either need external metadata like a MIME type, or a Byte Orientation Mark. It notes,

Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts.

The predecessor to UTF-16, which was called UCS-2 and didn't support surrogate pairs, had the same issues. UCS-2 should be avoided.


I believe it's mainly the backwards compatability that UTF8 gives with ASCII.

For an answer to the 'dangers' question, you need to specify what you mean by 'interacting'. Do you mean interacting with the shell, with libc, or with the kernel proper?