Document Similarity in ElasticSearch

I think the Elasticsearch documentation can easily be mis-interpreted.

Here "similarity" is not a comparison of documents or fields but rather a mechanism for scoring matching documents based on matching terms from the query.

The documentation states:

A similarity (scoring / ranking model) defines how matching documents are scored.

The similarity algorithms that Elasticsearch supports are probabilistic models based on term distribution in the corpus (index).

In regards to term vectors, this also can be mis-interpreted.

Here "term vectors" refer to statistics for the terms of a document that can easily be queried. It seems that any similarity measurements across term vectors would then have to be done in your application post-query. The documentation on term vectors state:

Returns information and statistics on terms in the fields of a particular document.

If you need a performant (fast) similarity metric over a very large corpus you might consider a low-rank embedding of your documents stored in an index for doing approximate nearest neighbor searches. After your KNN lookup, which greatly reduces the candidate set, you can do more costly metric calculations for ranking.

Here is an excellent resource for evaluation of approximate KNN solutions:https://github.com/erikbern/ann-benchmarks

CodeHunter

Document Similarity in ElasticSearch

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last