Multi language database encoding in search engine Multi language database encoding in search engine database database

Multi language database encoding in search engine


You have several issues to solve to make this work correctly.

First, you've chosen the utf8 character set to hold all your text. That is a good choice. If this is a new-in-2016 application, you might choose the utf8mb4 character set instead. Once you have chosen a character set your users should be able to read your text.

Second, for the sake of searching and sorting (WHERE and ORDER BY) you need to choose an appropriate collation for each language. For modern German, utf8_general_ci will work tolerably well. utf8_unicode_ci works a little better if you need standard lexical ordering. Read this. http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html

For modern Spanish, you should use utf8_spanish_ci. That's because in Spanish the N and Ñ characters are not considered the same. I don't know whether the general collation works for Turkish.

Notice that you seem to have confused the notions of character set and collation in your question. You've mentioned a collation with your Turkish column and a character set with your German column.

You can explicitly specify character set and collation in queries. For example, you can write

    WHERE _utf8 'München' COLLATE utf8_unicode_ci = table.name;

In this expression, _utf8 'München' is a character constant, and

   constant COLLATE utf8_unicode_ci = table.name

is a query specifier which includes an explicit collation name. Read this.http://dev.mysql.com/doc/refman/5.7/en/charset-collate.html

Third, you may want to assign a default collation to each language specific column. Default collations are baked into indexes, so they'll help accelerate searching.

Fourth, your users will need to use an appropriate input method (keyboard mapping, etc) to present data to your application. Turkish-language users hopefully know how to type Turkish words.