Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars

I have done this recently in Java:

public static final Pattern DIACRITICS_AND_FRIENDS    = Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");private static String stripDiacritics(String str) {    str = Normalizer.normalize(str, Normalizer.Form.NFD);    str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");    return str;}

This will do as you specified:

stripDiacritics("Björn")  = Bjorn

but it will fail on for example Białystok, because the ł character is not diacritic.

If you want to have a full-blown string simplifier, you will need a second cleanup round, for some more special characters that are not diacritics. Is this map, I have included the most common special characters that appear in our customer names. It is not a complete list, but it will give you the idea how to do extend it. The immutableMap is just a simple class from google-collections.

public class StringSimplifier {    public static final char DEFAULT_REPLACE_CHAR = '-';    public static final String DEFAULT_REPLACE = String.valueOf(DEFAULT_REPLACE_CHAR);    private static final ImmutableMap<String, String> NONDIACRITICS = ImmutableMap.<String, String>builder()        //Remove crap strings with no sematics        .put(".", "")        .put("\"", "")        .put("'", "")        //Keep relevant characters as seperation        .put(" ", DEFAULT_REPLACE)        .put("]", DEFAULT_REPLACE)        .put("[", DEFAULT_REPLACE)        .put(")", DEFAULT_REPLACE)        .put("(", DEFAULT_REPLACE)        .put("=", DEFAULT_REPLACE)        .put("!", DEFAULT_REPLACE)        .put("/", DEFAULT_REPLACE)        .put("\\", DEFAULT_REPLACE)        .put("&", DEFAULT_REPLACE)        .put(",", DEFAULT_REPLACE)        .put("?", DEFAULT_REPLACE)        .put("°", DEFAULT_REPLACE) //Remove ?? is diacritic?        .put("|", DEFAULT_REPLACE)        .put("<", DEFAULT_REPLACE)        .put(">", DEFAULT_REPLACE)        .put(";", DEFAULT_REPLACE)        .put(":", DEFAULT_REPLACE)        .put("_", DEFAULT_REPLACE)        .put("#", DEFAULT_REPLACE)        .put("~", DEFAULT_REPLACE)        .put("+", DEFAULT_REPLACE)        .put("*", DEFAULT_REPLACE)        //Replace non-diacritics as their equivalent characters        .put("\u0141", "l") // BiaLystock        .put("\u0142", "l") // Bialystock        .put("ß", "ss")        .put("æ", "ae")        .put("ø", "o")        .put("©", "c")        .put("\u00D0", "d") // All Ð ð from http://de.wikipedia.org/wiki/%C3%90        .put("\u00F0", "d")        .put("\u0110", "d")        .put("\u0111", "d")        .put("\u0189", "d")        .put("\u0256", "d")        .put("\u00DE", "th") // thorn Þ        .put("\u00FE", "th") // thorn þ        .build();    public static String simplifiedString(String orig) {        String str = orig;        if (str == null) {            return null;        }        str = stripDiacritics(str);        str = stripNonDiacritics(str);        if (str.length() == 0) {            // Ugly special case to work around non-existing empty strings            // in Oracle. Store original crapstring as simplified.            // It would return an empty string if Oracle could store it.            return orig;        }        return str.toLowerCase();    }    private static String stripNonDiacritics(String orig) {        StringBuffer ret = new StringBuffer();        String lastchar = null;        for (int i = 0; i < orig.length(); i++) {            String source = orig.substring(i, i + 1);            String replace = NONDIACRITICS.get(source);            String toReplace = replace == null ? String.valueOf(source) : replace;            if (DEFAULT_REPLACE.equals(lastchar) && DEFAULT_REPLACE.equals(toReplace)) {                toReplace = "";            } else {                lastchar = toReplace;            }            ret.append(toReplace);        }        if (ret.length() > 0 && DEFAULT_REPLACE_CHAR == ret.charAt(ret.length() - 1)) {            ret.deleteCharAt(ret.length() - 1);        }        return ret.toString();    }    /*    Special regular expression character ranges relevant for simplification -> see http://docstore.mik.ua/orelly/perl/prog3/ch05_04.htm    InCombiningDiacriticalMarks: special marks that are part of "normal" ä, ö, î etc..        IsSk: Symbol, Modifier see http://www.fileformat.info/info/unicode/category/Sk/list.htm        IsLm: Letter, Modifier see http://www.fileformat.info/info/unicode/category/Lm/list.htm     */    public static final Pattern DIACRITICS_AND_FRIENDS        = Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");    private static String stripDiacritics(String str) {        str = Normalizer.normalize(str, Normalizer.Form.NFD);        str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");        return str;    }}

java unicode diacritics transliteration

The core java.text package was designed to address this use case (matching strings without caring about diacritics, case, etc.).

Configure a Collator to sort on PRIMARY differences in characters. With that, create a CollationKey for each string. If all of your code is in Java, you can use the CollationKey directly. If you need to store the keys in a database or other sort of index, you can convert it to a byte array.

These classes use the Unicode standard case folding data to determine which characters are equivalent, and support various decomposition strategies.

Collator c = Collator.getInstance();c.setStrength(Collator.PRIMARY);Map<CollationKey, String> dictionary = new TreeMap<CollationKey, String>();dictionary.put(c.getCollationKey("Björn"), "Björn");...CollationKey query = c.getCollationKey("bjorn");System.out.println(dictionary.get(query)); // --> "Björn"

Note that collators are locale-specific. This is because "alphabetical order" is differs between locales (and even over time, as has been the case with Spanish). The Collator class relieves you from having to track all of these rules and keep them up to date.

java unicode diacritics transliteration

It's part of Apache Commons Lang as of ver. 3.1.

org.apache.commons.lang3.StringUtils.stripAccents("Añ");

returns An

CodeHunter

Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last