How can I check if a Python unicode string contains non-Western letters?

python django unicode

import unicodedata as udlatin_letters= {}def is_latin(uchr):    try: return latin_letters[uchr]    except KeyError:         return latin_letters.setdefault(uchr, 'LATIN' in ud.name(uchr))def only_roman_chars(unistr):    return all(is_latin(uchr)           for uchr in unistr           if uchr.isalpha()) # isalpha suggested by John Machin>>> only_roman_chars(u"ελληνικά means greek")False>>> only_roman_chars(u"frappé")True>>> only_roman_chars(u"hôtel lœwe")True>>> only_roman_chars(u"123 ångstrom ð áß")True>>> only_roman_chars(u"russian: гага")False

python django unicode

The top answer to this by @tzot is great, but IMO there should really be a library for this that works for all scripts. So, I made one (heavily based on that answer).

pip install alphabet-detector

and then use it directly:

from alphabet_detector import AlphabetDetectorad = AlphabetDetector()ad.only_alphabet_chars(u"ελληνικά means greek", "LATIN") #Falsead.only_alphabet_chars(u"ελληνικά", "GREEK") #Truead.only_alphabet_chars(u'سماوي يدور', 'ARABIC')ad.only_alphabet_chars(u'שלום', 'HEBREW')ad.only_alphabet_chars(u"frappé", "LATIN") #Truead.only_alphabet_chars(u"hôtel lœwe 67", "LATIN") #Truead.only_alphabet_chars(u"det forårsaker første", "LATIN") #Truead.only_alphabet_chars(u"Cyrillic and кириллический", "LATIN") #Falsead.only_alphabet_chars(u"кириллический", "CYRILLIC") #True

Also, a few convenience methods for major languages:

ad.is_cyrillic(u"Поиск") #True  ad.is_latin(u"howdy") #Truead.is_cjk(u"hi") #Falsead.is_cjk(u'汉字') #True

python django unicode

For what you say you want to do, your approach is about right. If you are running on Windows, I'd suggest using cp1252 instead of iso-8859-1. You might also allow cp1250 as well -- this would pick up eastern European countries like Poland, Czech Republic, Slovakia, Romania, Slovenia, Hungary, Croatia, etc where the alphabet is Latin-based. Other cp125x would include Turkish and Maltese ...

You may also like to consider transcription from Cyrillic to Latin; as far as I know there are several systems, one of which may be endorsed by the UPU (Universal Postal Union).

I'm a little intrigued by your comment "Our shipping department doesn't want to have to fill out labels with, e.g., Chinese addresses" ... three questions: (1) do you mean "addresses in country X" or "addresses written in X-ese characters" (2) wouldn't it be better for your system to print the labels? (3) how does the order get shipped if it fails your test?

CodeHunter

How can I check if a Python unicode string contains non-Western letters?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last