How can Perl and Unix sort, order Unicode strings in the same sequence? How can Perl and Unix sort, order Unicode strings in the same sequence? unix unix

How can Perl and Unix sort, order Unicode strings in the same sequence?


Using Unicode::Sort or Unicode::Sort::Locale makes no sense. You're not trying to sort based on Unicode definitions, you're trying to sort based on your locale. That's what use locale; is for.

I don't know why you didn't get the desired order out of cmp under use locale;.

You could process the decompressed files.

for q in file1.uniqc file2.uniqc ; do   perl -ne's/^\s*(\d+) //; for $c (1..$1) { print }' "$q"done | sort | uniq -c

It'll require more temporary storage, of course, but you'll get exactly the order you want.


I found a case use locale; didn't cause Perl's sort/cmp to give the same result as the sort utility. Weird.

$ export LC_COLLATE=en_US.UTF-8$ perl -Mlocale -e'print for sort { $a cmp $b } <>' data(($11$ perl -MPOSIX=strcoll -e'print for sort { strcoll($a, $b) } <>' data(($11$ sort data(1($1

Truth be told, it's the sort utility that's weird.


In the comments, @ninjalj points out that the weirdness is probably due to characters with undefined weights. When comparing such characters, the ordering is undefined, so different engines could produce different results. Your best bet to recreate the exact order would be to use the sort utility through IPC::Run3, but it sounds like that's not guaranteed to always result in the same order.


I can't answer directly, but I had problems getting a simple script to sort Serbian Latin text correctly, I found https://www.perl.com/pub/2012/06/perlunicook-demo-of-unicode-collation-and-printing.html/, copied his setup (my actual processing is much simpler than his), and finally got the correct alphabetic sorting for that language and locale. There's about as much as anyone would need to know about Unicode linguistic sorting in the whole set of guides at https://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html/.

I assume you want to sort Greek. Here's a very simple version of what I copied and adapted from the guide, which sorts correctly.

# min required setup for trial sortuse utf8;use v5.14; # for locale sorting and unicode_stringsuse Unicode::Normalize;use Unicode::Collate::Locale;my @words = qw{        Η        Ιθάκη        σ'        έδωσε        το        ωραίο        ταξίδι.        Χωρίς        αυτήν        δεν        θάβγαινες        στον        δρόμο.};print "Unsorted: @words\n";my $coll = Unicode::Collate::Locale->new( locale => "el_GR" );my @sorted_words = $coll->sort(@words);print "Sorted: @sorted_words\n";