Unix special case sensitive UTF-8 sort

shell sorting unicode utf-8 collation

You don't want things of mixed case in the first column mixed together depending on what the second column has, but that is exactly what a case insensitive sort gives you. It considers things that share a casefold to be identical.

The sort of this set of Unicode records:

abc aAbc dAbc babc eabæ g

is of course this:

abæ gabc aAbc bAbc dabc e

That's because the first and second letters are each “the same” (i.e., their casefolds are identical) in all five lines, so the first different letter is the third, which being an æ of course comes before c, which is what the other four records have as their third letter.

With the remaining lines, they all have the same first three letters, so it is their fourth letter that is dispositive, giving now the sequence a, b, d, e. Spaces do not (normally) matter in a Unicode sort, because it is an alphanumeric sort not a code point sort. We only consider letters here unless they are identical all the way down to case, and only then are other code points considered.

That’s just how sorting Unicode works.

The Unicode Collation Algorithm does not pay attention to Danish ordering unless you ask it to. The default DUCET entry for that code point puts things like æ and å next to a, ø next to o. The OED sorts these entries in this order:

 allergist allergy Allerød allers allethrin

That's because the o in "Allerød" follows the g in "allergy" and precedes the s in allers. Diacritics only matter if everything else is the same, so a hypothetical "alleroc" would precede "Allerød" and a hypothetical "allerog" would follow it but precede "allers".

That's just how sorting works in Unicode. Scandinavians hate it because they think it should just do whatever their idiosyncratic national systems do, but Unicode is not biased toward a particular language. If you want your idiotsyncrasies, you have to use locale sorting. To get a Danish locale-specific sort like this:

abc aAbc bAbc dabc eabæ g

You need to run your sort with a Danish locale specified, not in the broken POSIX way, but in the Unicode way.

First, you must give up on trying to use sort(1). It’s worse then useless: it’s unreliable and deceptive. If you have Unicode data, you should be using a Unicode sort, whether unmodified as the OED does or modified for your little village.

To produce the normal Unicode ordering, you must use:

#!/usr/bin/env perluse strict;use warnings;use open qw(:std :utf8);use utf8;use Unicode::Collate;my @lines = <<'End_of_Lines' =~ /\S.*\S\n/g;    abc a    Abc d    Abc b    abc e    abæ gEnd_of_Linesmy $collator = Unicode::Collate->new();print $collator->sort(@lines);

While to get the locale-restricted non-default just-for-you sort, you need:

#!/usr/bin/env perl    use strict;use warnings;use open qw(:std :utf8);use utf8;use Unicode::Collate::Locale;my @lines = <<'End_of_Lines' =~ /\S.*\S\n/g;    abc a    Abc d    Abc b    abc e    abæ gEnd_of_Linesmy $collator = Unicode::Collate::Locale->new(locale => "da");    print $collator->sort(@lines);

The Unicode::Collate module is included standard since Perl release v5.6.The Unicode::Collate::Locale module is included standard since Perl release v5.14, but it trivially installable from CPAN on earlier releases:

 $ sudo perl -MCPAN -e "install Unicode::Collate::Locale"

The reason you must use Perl for this is because you simply cannot trust vendor locales to work according to the Unicode Collation Algorithm, with or without locale modifications. I have never seen two different systems where they work the same way, which means that at least one of every pair is broken and perhaps both are. In contrast, you can guarantee that the UCA will always behave the same way no matter where you are. It doesn’t care what your Terminal can display. It doesn’t care about fonts. It doesn’t care if you’re redirected. It doesn’t care about which shell you’re running. It doesn’t care whether your Aunt Gertrude happens to run the code on the 5th Monday in a month. It just works, and it works the same way every time in every situation. Use the UCA. Accept no substitutes.

But just because you use the UCA doesn’t mean you need to accept the default ordering. The UCA was designed to be super-amenable to tailoring. If you want a locale sort, this is easy — and if there’s CLDR data for that locale, it is positively trivial. If you want to do a sort of book and movie titles, or of people’s names with the surname counting stronger than the forename and with all the Scottish Mc- and Mac- names sorting before M- but irrespective of each other, all these things are very very easy with the UCA. Anything you can imagine can be done, and usually with astonishing ease. The point is that with the UCA, you always start with a behavior that is guaranteed to work exactly the same way irrespective of platform or prejudice. That means you can rely on how it works when you want to apply your own customizations to it. Without that guarantee, all is lost.

You can get a pre-made command-line replacement (well, sort of) for the Unix sort(1) program which is UCA compliant here. It doesn’t do fields of course, but it does do quite a bit more.

CodeHunter

Unix special case sensitive UTF-8 sort

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last