How to index mixed language contents on Elasticsearch?

elasticsearch indexing full-text-search search-engine multiple-languages

What is the target you want to achieve? Do you want to have hits only in the language used at query time? Or would you also accept hits in any other language?

One approach would be to run all of elasticsearch's different language analyzers on the input and store the result in separate fields, for instance suffixed by the language of the current analyzer. Then, at query time, you would have to search in all of these fields if you have no method to guess the most relevant ones.

However, this is likely to explode since you create a multitude of unused duplicates. This is IMHO also less elegant than having separate indices.

I would strongly recommend to evaluate if you really do not know the number of languages you will see during production. Having a distinct index per language would give you much more control over the input/output and enable you to fine tune your engine to the actual use case.

Alternatively, you may start with a simple whitespace tokenizer and evaluate the quality of the search results (per use case).You will not have language specific stemming but at least token streams for most languages.

CodeHunter

How to index mixed language contents on Elasticsearch?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last