How to account for accent characters for regex in Python?
Try the following:
hashtags = re.findall(r'#(\w+)', str1, re.UNICODE)
EDITCheck the useful comment below from Martijn Pieters.
You may also want to use
import unicodedataoutput = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')
how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a?Assume you have loaded your unicode into a variable called my_unicode... normalizing à into a is this simple...
import unicodedataoutput = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')Explicit example...
myfoo = u'àà'myfoou'\xe0\xe0'unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')'aa'
check this answer it helped me a lot: How to convert unicode accented characters to pure ascii without accents?
I know this question is a little outdated but you may also consider adding the range of accented characters À (index 192) and ÿ (index 255) to your original regex.
hashtags = re.findall(r'#([A-Za-z0-9_À-ÿ]+)', str1)
which will return ['yogenfrüz']
Hope this'll help anyone else.