Why does UTF-8 text sort in different order between OS X and Linux? Why does UTF-8 text sort in different order between OS X and Linux? linux linux

Why does UTF-8 text sort in different order between OS X and Linux?


As it seems - your linux sort is not preserving proper UTF-8 order.

Hex UTF-8 representations of your unsorted.txt (first letters) would be:

- 30A6

foo - 0066

- 30C1

'foo' - 0027

- 6D25

taken from http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%E3%82%A6&mode=char

So proper sorting according to unicode collation (http://www.unicode.org/Public/UCA/latest/allkeys.txt) would be:

'foo' - line 487

foo - line 8966

- line 20875

- line 21004

- not in file

So, to answer your question, your linux machine is providing wrong collation tables to sort function. Unfortunately, i can't tell what is possible reason for that.

PS: There's similar question to yours here.

EDIT

As @ninjalj noticed, glibc doesn't use UCA, but ISO-14651 instead. This bug report suggest migration to UCA. Unfortunately, it's still not resolved.

Also, it could be somehow connected with question about ls case insensivity on MacOSX. Some people even suggest that it has something to do with HFS filesystem.