What is the best way to remove accents (normalize) in a Python unicode string?

python python-3.x unicode python-2.x diacritics

Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.

Example:

accented_string = u'Málaga'# accented_string is of type 'unicode'import unidecodeunaccented_string = unidecode.unidecode(accented_string)# unaccented_string contains 'Malaga'and is of type 'str'

python python-3.x unicode python-2.x diacritics

How about this:

import unicodedatadef strip_accents(s):   return ''.join(c for c in unicodedata.normalize('NFD', s)                  if unicodedata.category(c) != 'Mn')

This works on greek letters, too:

>>> strip_accents(u"A \u00c0 \u0394 \u038E")u'A A \u0394 \u03a5'>>>

The character category "Mn" stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).

And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".

python python-3.x unicode python-2.x diacritics

I just found this answer on the Web:

import unicodedatadef remove_accents(input_str):    nfkd_form = unicodedata.normalize('NFKD', input_str)    only_ascii = nfkd_form.encode('ASCII', 'ignore')    return only_ascii

It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.

Edit: this does the trick:

import unicodedatadef remove_accents(input_str):    nfkd_form = unicodedata.normalize('NFKD', input_str)    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it's a diacritic.

Edit 2: remove_accents expects a unicode string, not a byte string. If you have a byte string, then you must decode it into a unicode string like this:

encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you usebyte_string = b"café"  # or simply "café" before python 3.unicode_string = byte_string.decode(encoding)

CodeHunter

What is the best way to remove accents (normalize) in a Python unicode string?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last