How can I check if a Python unicode string contains non-Western letters?
import unicodedata as udlatin_letters= {}def is_latin(uchr): try: return latin_letters[uchr] except KeyError: return latin_letters.setdefault(uchr, 'LATIN' in ud.name(uchr))def only_roman_chars(unistr): return all(is_latin(uchr) for uchr in unistr if uchr.isalpha()) # isalpha suggested by John Machin>>> only_roman_chars(u"ελληνικά means greek")False>>> only_roman_chars(u"frappé")True>>> only_roman_chars(u"hôtel lœwe")True>>> only_roman_chars(u"123 ångstrom ð áß")True>>> only_roman_chars(u"russian: гага")False
The top answer to this by @tzot is great, but IMO there should really be a library for this that works for all scripts. So, I made one (heavily based on that answer).
pip install alphabet-detector
and then use it directly:
from alphabet_detector import AlphabetDetectorad = AlphabetDetector()ad.only_alphabet_chars(u"ελληνικά means greek", "LATIN") #Falsead.only_alphabet_chars(u"ελληνικά", "GREEK") #Truead.only_alphabet_chars(u'سماوي يدور', 'ARABIC')ad.only_alphabet_chars(u'שלום', 'HEBREW')ad.only_alphabet_chars(u"frappé", "LATIN") #Truead.only_alphabet_chars(u"hôtel lœwe 67", "LATIN") #Truead.only_alphabet_chars(u"det forårsaker første", "LATIN") #Truead.only_alphabet_chars(u"Cyrillic and кириллический", "LATIN") #Falsead.only_alphabet_chars(u"кириллический", "CYRILLIC") #True
Also, a few convenience methods for major languages:
ad.is_cyrillic(u"Поиск") #True ad.is_latin(u"howdy") #Truead.is_cjk(u"hi") #Falsead.is_cjk(u'汉字') #True
For what you say you want to do, your approach is about right. If you are running on Windows, I'd suggest using cp1252
instead of iso-8859-1
. You might also allow cp1250
as well -- this would pick up eastern European countries like Poland, Czech Republic, Slovakia, Romania, Slovenia, Hungary, Croatia, etc where the alphabet is Latin-based. Other cp125x would include Turkish and Maltese ...
You may also like to consider transcription from Cyrillic to Latin; as far as I know there are several systems, one of which may be endorsed by the UPU (Universal Postal Union).
I'm a little intrigued by your comment "Our shipping department doesn't want to have to fill out labels with, e.g., Chinese addresses" ... three questions: (1) do you mean "addresses in country X" or "addresses written in X-ese characters" (2) wouldn't it be better for your system to print the labels? (3) how does the order get shipped if it fails your test?