How to account for accent characters for regex in Python? How to account for accent characters for regex in Python? python python

How to account for accent characters for regex in Python?


Try the following:

hashtags = re.findall(r'#(\w+)', str1, re.UNICODE)

Regex101 Demo

EDITCheck the useful comment below from Martijn Pieters.


You may also want to use

import unicodedataoutput = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')

how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a?Assume you have loaded your unicode into a variable called my_unicode... normalizing à into a is this simple...

import unicodedataoutput = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')Explicit example...

myfoo = u'àà'myfoou'\xe0\xe0'unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')'aa'

check this answer it helped me a lot: How to convert unicode accented characters to pure ascii without accents?


I know this question is a little outdated but you may also consider adding the range of accented characters À (index 192) and ÿ (index 255) to your original regex.

hashtags = re.findall(r'#([A-Za-z0-9_À-ÿ]+)', str1)

which will return ['yogenfrüz']

Hope this'll help anyone else.