Python regex matching Unicode properties Python regex matching Unicode properties python python

Python regex matching Unicode properties


The regex module (an alternative to the standard re module) supports Unicode codepoint properties with the \p{} syntax.


Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian} to match Armenian characters. \p{Ll} or \p{Zs} work too.


You can painstakingly use unicodedata on each character:

import unicodedatadef strip_accents(x):    return u''.join(c for c in unicodedata.normalize('NFD', x) if unicodedata.category(c) != 'Mn')