What is the best way to remove accents (normalize) in a Python unicode string? What is the best way to remove accents (normalize) in a Python unicode string? python python

What is the best way to remove accents (normalize) in a Python unicode string?


Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.

Example:

accented_string = u'Málaga'# accented_string is of type 'unicode'import unidecodeunaccented_string = unidecode.unidecode(accented_string)# unaccented_string contains 'Malaga'and is of type 'str'


How about this:

import unicodedatadef strip_accents(s):   return ''.join(c for c in unicodedata.normalize('NFD', s)                  if unicodedata.category(c) != 'Mn')

This works on greek letters, too:

>>> strip_accents(u"A \u00c0 \u0394 \u038E")u'A A \u0394 \u03a5'>>> 

The character category "Mn" stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).

And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".


I just found this answer on the Web:

import unicodedatadef remove_accents(input_str):    nfkd_form = unicodedata.normalize('NFKD', input_str)    only_ascii = nfkd_form.encode('ASCII', 'ignore')    return only_ascii

It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.

Edit: this does the trick:

import unicodedatadef remove_accents(input_str):    nfkd_form = unicodedata.normalize('NFKD', input_str)    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it's a diacritic.

Edit 2: remove_accents expects a unicode string, not a byte string. If you have a byte string, then you must decode it into a unicode string like this:

encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you usebyte_string = b"café"  # or simply "café" before python 3.unicode_string = byte_string.decode(encoding)