How are Linux shells and filesystem Unicode-aware? How are Linux shells and filesystem Unicode-aware? shell shell

How are Linux shells and filesystem Unicode-aware?


It is true that UTF-16 and other "wide character" encodings cannot be used for pathnames in Linux (nor any other POSIX-compliant OS).

It is not true in principle that anyone assumes UTF-8, although that may come to be true in the future as other encodings die off. What Unix-style programs assume is an ASCII-compatible encoding. Any encoding with these properties is ASCII-compatible:

  • The fundamental unit of the encoding is a byte, not a larger entity. Some characters might be encoded as a sequence of bytes, but there must be at least 127 characters that are encoded using only a single byte, namely:
  • The characters defined by ASCII (nowadays, these are best described as Unicode codepoints U+000000 through U+00007F, inclusive) are encoded as single bytes, with values equal to their Unicode codepoints.
  • Conversely, the bytes with values 0x00 through 0x7F must always decode to the characters defined by ASCII, regardless of surrounding context. (For instance, the string 0x81 0x2F must decode to two characters, whatever 0x81 decodes to and then /.)

UTF-8 is ASCII-compatible, but so are all of the ISO-8859-n pages, the EUC encodings, and many, many others.

Some programs may also require an additional property:

  • The encoding of a character, viewed as a sequence of bytes, is never a proper prefix nor a proper suffix of the encoding of any other character.

UTF-8 has this property, but (I think) EUC-JP doesn't.

It is also the case that many "Unix-style" programs reserve the codepoint U+000000 (NUL) for use as a string terminator. This is technically not a constraint on the encoding, but on the text itself. (The closely-related requirement that the byte 0x00 not appear in the middle of a string is a consequence of this plus the requirement that 0x00 maps to U+000000 regardless of surrounding context.)


There is no encoding of filenames in Linux (in ext family of filesystems at any rate). Filenames are sequences of bytes, not characters. It is up to application programs to interpret these bytes as UTF-8 or anything else. The filesystem doesn't care.

POSIX stipulates that the shell obeys the locale environment vsriables such as LC_CTYPE when performing pattern matches. Thus, pattern-matching code that just compares bytes regardless of the encoding would not be compatible with your hypothetical encoding, or with any stateful encoding. But this doesn't seem to matter much as such encodings are not commonly supported by existing locales. UTF-8 on the other hand seems to be supported well: in my experiments bash correctly matches the ? character with a single Unicode character, rather than a single byte, in the filename (given an UTF-8 locale) as prescribed by POSIX.