matching unicode characters in python regular expressions matching unicode characters in python regular expressions python python

matching unicode characters in python regular expressions


You need to specify the re.UNICODE flag, and input your string as a Unicode string by using the u prefix:

>>> re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', u'/by_tag/påske/øyfjell.jpg', re.UNICODE).groupdict(){'tag': u'p\xe5ske', 'filename': u'\xf8yfjell.jpg'}

This is in Python 2; in Python 3 you must leave out the u because all strings are Unicode.


You need the UNICODE flag:

m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/påske/øyfjell.jpg', re.UNICODE)


In Python 2, you need the re.UNICODE flag and the unicode string constructor

>>> re.sub(r"[\w]+","___",unicode(",./hello-=+","utf-8"),flags=re.UNICODE)u',./___-=+'>>> re.sub(r"[\w]+","___",unicode(",./cześć-=+","utf-8"),flags=re.UNICODE)u',./___-=+'>>> re.sub(r"[\w]+","___",unicode(",./привет-=+","utf-8"),flags=re.UNICODE)u',./___-=+'>>> re.sub(r"[\w]+","___",unicode(",./你好-=+","utf-8"),flags=re.UNICODE)u',./___-=+'>>> re.sub(r"[\w]+","___",unicode(",./你好,世界-=+","utf-8"),flags=re.UNICODE)u',./___\uff0c___-=+'>>> print re.sub(r"[\w]+","___",unicode(",./你好,世界-=+","utf-8"),flags=re.UNICODE),./___,___-=+

(In the latter case, the comma is Chinese comma.)