How to remove all of the diacritics from a file? How to remove all of the diacritics from a file? bash bash

How to remove all of the diacritics from a file?


If you check the man page of the tool iconv:

//TRANSLIT
When the string "//TRANSLIT" is appended to --to-code, transliteration is activated. This means that when a character cannot be represented in thetarget character set, it can be approximated through one or several similarly looking characters.

so we could do :

kent$  cat test1    Replace ā, á, ǎ, and à with a.    Replace ē, é, ě, and è with e.    Replace ī, í, ǐ, and ì with i.    Replace ō, ó, ǒ, and ò with o.    Replace ū, ú, ǔ, and ù with u.    Replace ǖ, ǘ, ǚ, and ǜ with ü.    Replace Ā, Á, Ǎ, and À with A.    Replace Ē, É, Ě, and È with E.    Replace Ī, Í, Ǐ, and Ì with I.    Replace Ō, Ó, Ǒ, and Ò with O.    Replace Ū, Ú, Ǔ, and Ù with U.    Replace Ǖ, Ǘ, Ǚ, and Ǜ with U.kent$  iconv -f utf8 -t ascii//TRANSLIT test1    Replace a, a, a, and a with a.    Replace e, e, e, and e with e.    Replace i, i, i, and i with i.    Replace o, o, o, and o with o.    Replace u, u, u, and u with u.    Replace u, u, u, and u with u.    Replace A, A, A, and A with A.    Replace E, E, E, and E with E.    Replace I, I, I, and I with I.    Replace O, O, O, and O with O.    Replace U, U, U, and U with U.    Replace U, U, U, and U with U.


This might work for you:

sed -i 'y/āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUÜÜÜÜ/' file


I like iconv as it handles all accents variations :

cat non-ascii.txt | iconv -f utf8 -t ascii//TRANSLIT//IGNORE > ascii.txt