how edge ngram token filter differs from ngram token filter? how edge ngram token filter differs from ngram token filter? elasticsearch elasticsearch

how edge ngram token filter differs from ngram token filter?


I think the documentation is pretty clear on this:

This tokenizer is very similar to nGram but only keeps n-grams which start at the beginning of a token.

And the best example for nGram tokenizer again comes from the documentation:

curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04'    # FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04

With this tokenizer definition:

                    "type" : "nGram",                    "min_gram" : "2",                    "max_gram" : "3",                    "token_chars": [ "letter", "digit" ]

In short:

  • the tokenizer, depending on the configuration, will create tokens. In this example: FC, Schalke, 04.
  • nGram generates groups of characters of minimum min_gram size and maximum max_gram size from an input text. Basically, the tokens are split into small chunks and each chunk is anchored on a character (it doesn't matter where this character is, all of them will create chunks).
  • edgeNGram does the same but the chunks always start from the beginning of each token. Basically, the chunks are anchored at the beginning of the tokens.

For the same text as above, an edgeNGram generates this: FC, Sc, Sch, Scha, Schal, 04. Every "word" in the text is considered and for every "word" the first character is the starting point (F from FC, S from Schalke and 0 from 04).


ngram moves the cursor while breaking the text:

Text: Red WineOptions:    ngram_min: 2    ngram_max: 3Result: Re, Red, ed, Wi, Win, in, ine, ne

As you see here, the cursor moves ngram_min times to the next fragment until it reaches the ngram_max.


ngram_edge does the exact same thing as ngram but it doesn't move the cursor:

Text: Red WineOptions:    ngram_min: 2    ngram_max: 3Result: Re, Red

Why didn't it return Win? because the cursor doesn't move, it'll always start from the position zero, moves ngram_min times and backs to the same position (which is always zero).


Think of ngram_edge as if it was a substring function in other programming languages such as JavaScript:

// ngramlet str = "Red Wine";console.log(str.substring(0, 2)); // Reconsole.log(str.substring(0, 3)); // Redconsole.log(str.substring(1, 3)); // ed, start from position 1// ...// ngram_edge// notice that the position is always zeroconsole.log(str.substring(0, 2)); // Reconsole.log(str.substring(0, 3)); // Red

Try it out by yourself using Kibana:

PUT my_index{  "settings": {    "analysis": {      "tokenizer": {        "my_ngram_tokenizer" : {          "type" : "ngram",          "min_gram": 2,          "max_gram": 3,          "token_chars": [            "letter",            "digit"          ]        },        "my_edge_ngram_tokenizer": {          "type": "edge_ngram",          "min_gram": 2,          "max_gram": 3        }      }    }  }}POST my_index/_analyze{  "tokenizer": "my_ngram_tokenizer",  "text": "Red Wine"}POST my_index/_analyze{  "tokenizer": "my_edge_ngram_tokenizer",   "text": "Red Wine"}