Get unique value for strings in two languages Get unique value for strings in two languages codeigniter codeigniter

Get unique value for strings in two languages


Simply manipulate unicode strings. A good choice of encoding is UTF-8, for example.

You should only manipulate unicode strings throughout your program, to avoid issues with some characters getting garbled when users enter special characters.

If what you're seeking to do is compare strings with some characters considered equivalent, for example with english and greek, A would be equivalent to alpha, then you need to build a list of equivalences, and transform the strings into a sequence of numbers, where each number is the number of the equivalence class of the character in the original string.

The fastest method would be to build a dictionary (key/value pairs) like this, in PHP:

equiv=array('a'=>1, 'i'=>1, 'u'=>1, 'alif'=>1, 'b'=>2, 'baa'=>2, ...);

where you would replace 'alif' and 'baa' by the actual arabic characters in unicode.

Then, transform the strings:

transformed=array_map(function($c) { return $equiv[$c]; }, str_split($str));

And then compare two transformed strings.

This is called collating, and can also be used for case-insensitive comparisons of strings (make 'ab' equivalent to 'AB').

Other than using numbers to identify the character classes, one can choose to use a character as the representative individual of its class. Then you would do :

function fold_char($c) {    return array_key_exists($c, $equiv) ? $equiv[$c] : $c;}equiv=array('a'=>'a', 'A'=>'a', 'i'=>'a', 'I'=>'a', 'u'=>'a', 'U'=>'a' 'alif'=>'a', 'b'=>'b', 'B'=>'b'  'baa'=>'b', ...);transformed=implode('', array_map(fold_char, str_split($str));

This would transform the string with the characters 'a' 'B' 'U' into 'aba', and the string with the characters 'alif', 'baa', 'alif' into 'aba', so they would be considered equivalent.

You can then store the converted string in your database along with the user name, to quickly check whether a given username already exists.

I know some database engines allow you to define your own collating sequences (basically the equiv array above), but that would be the matter for another question.


I think you're going to need to find a different approach, since there's no way to uniquely transliterate any arbitrary strings between to alphabets. Especially between the Latin alphabet, which uses vowels, and the Arabic alphabet which uses diacritics.

There are several ways to render practically any Latin string in Arabic. You have the English V which is often transliterated to ف or ٻ. The Arabic خ and ذ, among others, can also be written in English in several ways. And this is just me struggling to remember the Arabic I learned in highschool.

In short, you'll have to build a heuristic database that can guess, for a given Arabic or English string, all the possible permutations of that string in the other alphabet - and STILL you'll be constantly surprised at the variations that your users will come up with.