How can I check if a Python unicode string contains non-Western letters? How can I check if a Python unicode string contains non-Western letters? django django

How can I check if a Python unicode string contains non-Western letters?


import unicodedata as udlatin_letters= {}def is_latin(uchr):    try: return latin_letters[uchr]    except KeyError:         return latin_letters.setdefault(uchr, 'LATIN' in ud.name(uchr))def only_roman_chars(unistr):    return all(is_latin(uchr)           for uchr in unistr           if uchr.isalpha()) # isalpha suggested by John Machin>>> only_roman_chars(u"ελληνικά means greek")False>>> only_roman_chars(u"frappé")True>>> only_roman_chars(u"hôtel lœwe")True>>> only_roman_chars(u"123 ångstrom ð áß")True>>> only_roman_chars(u"russian: гага")False


The top answer to this by @tzot is great, but IMO there should really be a library for this that works for all scripts. So, I made one (heavily based on that answer).

pip install alphabet-detector

and then use it directly:

from alphabet_detector import AlphabetDetectorad = AlphabetDetector()ad.only_alphabet_chars(u"ελληνικά means greek", "LATIN") #Falsead.only_alphabet_chars(u"ελληνικά", "GREEK") #Truead.only_alphabet_chars(u'سماوي يدور', 'ARABIC')ad.only_alphabet_chars(u'שלום', 'HEBREW')ad.only_alphabet_chars(u"frappé", "LATIN") #Truead.only_alphabet_chars(u"hôtel lœwe 67", "LATIN") #Truead.only_alphabet_chars(u"det forårsaker første", "LATIN") #Truead.only_alphabet_chars(u"Cyrillic and кириллический", "LATIN") #Falsead.only_alphabet_chars(u"кириллический", "CYRILLIC") #True

Also, a few convenience methods for major languages:

ad.is_cyrillic(u"Поиск") #True  ad.is_latin(u"howdy") #Truead.is_cjk(u"hi") #Falsead.is_cjk(u'汉字') #True


For what you say you want to do, your approach is about right. If you are running on Windows, I'd suggest using cp1252 instead of iso-8859-1. You might also allow cp1250 as well -- this would pick up eastern European countries like Poland, Czech Republic, Slovakia, Romania, Slovenia, Hungary, Croatia, etc where the alphabet is Latin-based. Other cp125x would include Turkish and Maltese ...

You may also like to consider transcription from Cyrillic to Latin; as far as I know there are several systems, one of which may be endorsed by the UPU (Universal Postal Union).

I'm a little intrigued by your comment "Our shipping department doesn't want to have to fill out labels with, e.g., Chinese addresses" ... three questions: (1) do you mean "addresses in country X" or "addresses written in X-ese characters" (2) wouldn't it be better for your system to print the labels? (3) how does the order get shipped if it fails your test?