Simple explanation of different ElasticSearch similarity algorithms Simple explanation of different ElasticSearch similarity algorithms elasticsearch elasticsearch

Simple explanation of different ElasticSearch similarity algorithms


The problem you run into here, is by the description set forward in the linked answer, Lucene's default similarity, and bm25 are fundamentally identical, in that they both factor in:

  • more occurrences in the document are preferred
  • terms rarer in the corpus are preferred
  • shorter documents are more heavily weighted
  • other functions used to adjust score, boosts, etc.

dfr actually encompasses 7 different base-models alone, each using a different scoring algorithm, followed by two highly configurable normalization steps. A number of configuration options fit the very general steps above, some diverge from it.

Similarly, ib allows some significant configuration as well, but generally hits the same high points, of favoring higher term frequency, favoring matches on terms that are more rare (by some description), and adjusting for document length, boosts, and other possible normalizations.