How do locales work in Linux / POSIX and what transformations are applied? How do locales work in Linux / POSIX and what transformations are applied? linux linux

How do locales work in Linux / POSIX and what transformations are applied?


I have boiled down the problem to an issue with the strcoll() function, which is not related to Unicode normalization. Recap: My minimal example that demonstrates the different behaviour of uniq depending on the current locale was:

$ echo -e "\xc9\xa2\n\xc9\xac" > test.txt$ cat test.txtɢɬ$ LC_COLLATE=C uniq -D test.txt$ LC_COLLATE=en_US.UTF-8 uniq -D test.txtɢɬ

Obviously, if the locale is en_US.UTF-8 uniq treats ɢ and ɬ as duplicates, which shouldn't be the case. I then ran the same commands again with valgrind and investigated both call graphs with kcachegrind.

$ LC_COLLATE=C valgrind --tool=callgrind uniq -D test.txt$ LC_COLLATE=en_US.UTF-8 valgrind --tool=callgrind uniq -D test.txt$ kcachegrind callgrind.out.5754 &$ kcachegrind callgrind.out.5763 &

The only difference was, that the version with LC_COLLATE=en_US.UTF-8 called strcoll() whereas LC_COLLATE=C did not. So I came up with the following minimal example on strcoll():

#include <iostream>#include <cstring>#include <clocale>int main(){    const char* s1 = "\xc9\xa2";    const char* s2 = "\xc9\xac";    std::cout << s1 << std::endl;    std::cout << s2 << std::endl;    std::setlocale(LC_COLLATE, "en_US.UTF-8");    std::cout << std::strcoll(s1, s2) << std::endl;    std::cout << std::strcmp(s1, s2) << std::endl;    std::setlocale(LC_COLLATE, "C");    std::cout << std::strcoll(s1, s2) << std::endl;    std::cout << std::strcmp(s1, s2) << std::endl;    std::cout << std::endl;    s1 = "\xa2";    s2 = "\xac";    std::cout << s1 << std::endl;    std::cout << s2 << std::endl;    std::setlocale(LC_COLLATE, "en_US.UTF-8");    std::cout << std::strcoll(s1, s2) << std::endl;    std::cout << std::strcmp(s1, s2) << std::endl;    std::setlocale(LC_COLLATE, "C");    std::cout << std::strcoll(s1, s2) << std::endl;    std::cout << std::strcmp(s1, s2) << std::endl;}

Output:

ɢɬ0-1-10-1��0-1-10-1

So, what's wrong here? Why does strcoll() returns 0 (equal) for two different characters?


It could be due to Unicode normalization. There are sequences of code points in Unicode which are distinct and yet are considered equivalent.

One simple example of that is combining characters. Many accented characters like "é" can be represented as either a single code point (U+00E9, LATIN SMALL LETTER E WITH ACUTE), or as a combination of both an unaccepted character and a combining character, e.g. the two-character sequence <U+0065, U+0301> (LATIN SMALL LETTER E, COMBINING ACUTE ACCENT).

Those two byte sequences are obviously different, and so in the C locale, they compare as different. But in a UTF-8 locale, they're treated as identical due to Unicode normalization.

Here's a simple two-line file with this example:

$ echo -e '\xc3\xa9\ne\xcc\x81' > test.txt$ cat test.txtéé$ hexdump -C test.txt00000000  c3 a9 0a 65 cc 81 0a                              |...e...|00000007$ LC_ALL=C uniq -d test.txt  # No output$ LC_ALL=en_US.UTF-8 uniq -d test.txté

Edit by n.m. Not all Linux systems do Unicode normalization.


Purely conjecture at this point, since we can't see the actual data, but I would guess something like this is going on.

UTF-8 encodes code points 0-127 as their representative byte value. Values above that take two or more bytes. There is a canonical definition of which ranges of values use a certain number of bytes, and the format of those bytes. However, a code point could be encoded in a number of ways. For example - 32, the ASCII space, could be encoded as 0x20 (its canonical encoding), but, it could also be encoded as 0xc0a0. That violates a strict interpretation of the encoding, and so a well formed UTF-8 writing application would never encode it that way. However, decoders are generally written to be more forgiving, to deal with faulty encodings, and so the UTF-8 decoder in your particular situation might be seeing a sequence that isn't a strictly conforming encoded code point and interpreting it in the most reasonable way that it can, which would cause it to see certain multi-byte sequences as equivalent to others. Locale collating sequences would then also have a further effect.

In the C locale, 0x20 would certainly be sorted before 0xc0, but in UTF-8, if it grabs a following 0xa0, then that single byte would be considered equal to the two bytes, and so would sort together.