Python not sorting unicode properly. Strcoll doesn't help Python not sorting unicode properly. Strcoll doesn't help python python

Python not sorting unicode properly. Strcoll doesn't help


Apparently, the only way for sorting to work on all platforms is to use the ICU library with PyICU bindings (PyICU on PyPI).

On OS X: sudo port install py26-pyicu, minding bug described here: https://svn.macports.org/ticket/23429 (oh the joy of using macports).

PyICUs documentation is unfortunately severely lacking, but I managed to find out how it's done:

import PyICUcollator = PyICU.Collator.createInstance(PyICU.Locale('pl_PL.UTF-8'))print [i for i in sorted([u'a', u'z', u'ą'], cmp=collator.compare)]

which gives:

[u'a', u'ą', u'z']

Another pro - @bobince: it's thread-safe, so not useless when setting request-wise locales.


Just to add to tkopczuk's investigation: This is definitely a gcc bug, at least for version 4.2.1 on OS X 10.6.4. It can be reproduced by calling C strcoll() directly as in this snippet.

EDIT: Still on the same system, I find that for the UTF-8 versions of de_DE, fr_FR, pl_PL, the problem is there, but for the ISO-88591 versions of fr_FR and de_DE, sort order is correct. Unfortunately for the OP, ISO-88592 pl_PL is also buggy:

The order for Polish ISO-8859 is:LATIN SMALL LETTER ALATIN SMALL LETTER ZLATIN SMALL LETTER A WITH OGONEKThe LC_COLLATE culture and encoding settings were pl_PL, ISO8859-2.The order for Polish Unicode is:LATIN SMALL LETTER ALATIN SMALL LETTER ZLATIN SMALL LETTER A WITH OGONEKThe LC_COLLATE culture and encoding settings were pl_PL, UTF8.The order for German Unicode is:LATIN SMALL LETTER ALATIN SMALL LETTER ZLATIN SMALL LETTER A WITH DIAERESISThe LC_COLLATE culture and encoding settings were de_DE, UTF8.The order for German ISO-8859 is:LATIN SMALL LETTER ALATIN SMALL LETTER A WITH DIAERESISLATIN SMALL LETTER ZThe LC_COLLATE culture and encoding settings were de_DE, ISO8859-1.The order for Fremch ISO-8859 is:LATIN SMALL LETTER ALATIN SMALL LETTER E WITH ACUTELATIN SMALL LETTER ZThe LC_COLLATE culture and encoding settings were fr_FR, ISO8859-1.The order for French Unicode is:LATIN SMALL LETTER ALATIN SMALL LETTER ZLATIN SMALL LETTER E WITH ACUTEThe LC_COLLATE culture and encoding settings were fr_FR, UTF8.


Here is how i managed to sort Persian language correctly (without PyICU)(using python 3.x):

First set the locale (don't forget to import locale and platform)

if platform.system() == 'Linux':    locale.setlocale(locale.LC_ALL, 'fa_IR.UTF-8')elif platform.system() == 'Windows':   locale.setlocale(locale.LC_ALL, 'Persian_Iran.1256')else:   pass (or any other OS)

Then sort using key:

a = ['ا','ب','پ','ت','ث','ج','چ','ح','خ','د','ذ','ر','ز','ژ','س','ش','ص','ض','ط','ظ','ع','غ','ف','ق','ک','گ','ل','م','ن','و','ه','ي']print(sorted(a,key=locale.strxfrm))

For list of Objects:

a = [{'id':"ا"},{'id':"ب"},{'id':"پ"},{'id':"ت"},{'id':"ث"},{'id':"ج"},{'id':"چ"},{'id':"ح"},{'id':"خ"},{'id':"د"},{'id':"ذ"},{'id':"ر"},{'id':"ز"},{'id':"ژ"},{'id':"س"},{'id':"ش"},{'id':"ص"},{'id':"ض"},{'id':"ط"},{'id':"ظ"},{'id':"ع"},{'id':"غ"},{'id':"ف"},{'id':"ق"},{'id':"ک"},{'id':"گ"},{'id':"ل"},{'id':"م"},{'id':"ن"},{'id':"و"},{'id':"ه"},{'id':"ي"}]print(sorted(a, key=lambda x: locale.strxfrm(x['id']))

Finally you can return the locale:

locale.setlocale(locale.LC_ALL, '')