matching unicode characters in python regular expressions
You need to specify the re.UNICODE
flag, and input your string as a Unicode string by using the u
prefix:
>>> re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', u'/by_tag/påske/øyfjell.jpg', re.UNICODE).groupdict(){'tag': u'p\xe5ske', 'filename': u'\xf8yfjell.jpg'}
This is in Python 2; in Python 3 you must leave out the u
because all strings are Unicode.
You need the UNICODE flag:
m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/påske/øyfjell.jpg', re.UNICODE)
In Python 2, you need the re.UNICODE flag and the unicode string constructor
>>> re.sub(r"[\w]+","___",unicode(",./hello-=+","utf-8"),flags=re.UNICODE)u',./___-=+'>>> re.sub(r"[\w]+","___",unicode(",./cześć-=+","utf-8"),flags=re.UNICODE)u',./___-=+'>>> re.sub(r"[\w]+","___",unicode(",./привет-=+","utf-8"),flags=re.UNICODE)u',./___-=+'>>> re.sub(r"[\w]+","___",unicode(",./你好-=+","utf-8"),flags=re.UNICODE)u',./___-=+'>>> re.sub(r"[\w]+","___",unicode(",./你好,世界-=+","utf-8"),flags=re.UNICODE)u',./___\uff0c___-=+'>>> print re.sub(r"[\w]+","___",unicode(",./你好,世界-=+","utf-8"),flags=re.UNICODE),./___,___-=+
(In the latter case, the comma is Chinese comma.)