How to index mixed language contents on Elasticsearch? How to index mixed language contents on Elasticsearch? elasticsearch elasticsearch

How to index mixed language contents on Elasticsearch?


What is the target you want to achieve? Do you want to have hits only in the language used at query time? Or would you also accept hits in any other language?

One approach would be to run all of elasticsearch's different language analyzers on the input and store the result in separate fields, for instance suffixed by the language of the current analyzer. Then, at query time, you would have to search in all of these fields if you have no method to guess the most relevant ones.

However, this is likely to explode since you create a multitude of unused duplicates. This is IMHO also less elegant than having separate indices.

I would strongly recommend to evaluate if you really do not know the number of languages you will see during production. Having a distinct index per language would give you much more control over the input/output and enable you to fine tune your engine to the actual use case.

Alternatively, you may start with a simple whitespace tokenizer and evaluate the quality of the search results (per use case).You will not have language specific stemming but at least token streams for most languages.