Python not sorting unicode properly. Strcoll doesn't help

python unicode locale

Apparently, the only way for sorting to work on all platforms is to use the ICU library with PyICU bindings (PyICU on PyPI).

On OS X: sudo port install py26-pyicu, minding bug described here: https://svn.macports.org/ticket/23429 (oh the joy of using macports).

PyICUs documentation is unfortunately severely lacking, but I managed to find out how it's done:

import PyICUcollator = PyICU.Collator.createInstance(PyICU.Locale('pl_PL.UTF-8'))print [i for i in sorted([u'a', u'z', u'ą'], cmp=collator.compare)]

which gives:

[u'a', u'ą', u'z']

Another pro - @bobince: it's thread-safe, so not useless when setting request-wise locales.

python unicode locale

Just to add to tkopczuk's investigation: This is definitely a gcc bug, at least for version 4.2.1 on OS X 10.6.4. It can be reproduced by calling C strcoll() directly as in this snippet.

EDIT: Still on the same system, I find that for the UTF-8 versions of de_DE, fr_FR, pl_PL, the problem is there, but for the ISO-88591 versions of fr_FR and de_DE, sort order is correct. Unfortunately for the OP, ISO-88592 pl_PL is also buggy:

The order for Polish ISO-8859 is:LATIN SMALL LETTER ALATIN SMALL LETTER ZLATIN SMALL LETTER A WITH OGONEKThe LC_COLLATE culture and encoding settings were pl_PL, ISO8859-2.The order for Polish Unicode is:LATIN SMALL LETTER ALATIN SMALL LETTER ZLATIN SMALL LETTER A WITH OGONEKThe LC_COLLATE culture and encoding settings were pl_PL, UTF8.The order for German Unicode is:LATIN SMALL LETTER ALATIN SMALL LETTER ZLATIN SMALL LETTER A WITH DIAERESISThe LC_COLLATE culture and encoding settings were de_DE, UTF8.The order for German ISO-8859 is:LATIN SMALL LETTER ALATIN SMALL LETTER A WITH DIAERESISLATIN SMALL LETTER ZThe LC_COLLATE culture and encoding settings were de_DE, ISO8859-1.The order for Fremch ISO-8859 is:LATIN SMALL LETTER ALATIN SMALL LETTER E WITH ACUTELATIN SMALL LETTER ZThe LC_COLLATE culture and encoding settings were fr_FR, ISO8859-1.The order for French Unicode is:LATIN SMALL LETTER ALATIN SMALL LETTER ZLATIN SMALL LETTER E WITH ACUTEThe LC_COLLATE culture and encoding settings were fr_FR, UTF8.

python unicode locale

Here is how i managed to sort Persian language correctly (without PyICU)(using python 3.x):

First set the locale (don't forget to import locale and platform)

if platform.system() == 'Linux':    locale.setlocale(locale.LC_ALL, 'fa_IR.UTF-8')elif platform.system() == 'Windows':   locale.setlocale(locale.LC_ALL, 'Persian_Iran.1256')else:   pass (or any other OS)

Then sort using key:

a = ['ا','ب','پ','ت','ث','ج','چ','ح','خ','د','ذ','ر','ز','ژ','س','ش','ص','ض','ط','ظ','ع','غ','ف','ق','ک','گ','ل','م','ن','و','ه','ي']print(sorted(a,key=locale.strxfrm))

For list of Objects:

a = [{'id':"ا"},{'id':"ب"},{'id':"پ"},{'id':"ت"},{'id':"ث"},{'id':"ج"},{'id':"چ"},{'id':"ح"},{'id':"خ"},{'id':"د"},{'id':"ذ"},{'id':"ر"},{'id':"ز"},{'id':"ژ"},{'id':"س"},{'id':"ش"},{'id':"ص"},{'id':"ض"},{'id':"ط"},{'id':"ظ"},{'id':"ع"},{'id':"غ"},{'id':"ف"},{'id':"ق"},{'id':"ک"},{'id':"گ"},{'id':"ل"},{'id':"م"},{'id':"ن"},{'id':"و"},{'id':"ه"},{'id':"ي"}]print(sorted(a, key=lambda x: locale.strxfrm(x['id']))

Finally you can return the locale:

locale.setlocale(locale.LC_ALL, '')

CodeHunter

Python not sorting unicode properly. Strcoll doesn't help

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last