how edge ngram token filter differs from ngram token filter?

I think the documentation is pretty clear on this:

This tokenizer is very similar to nGram but only keeps n-grams which start at the beginning of a token.

And the best example for nGram tokenizer again comes from the documentation:

curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04'    # FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04

With this tokenizer definition:

                    "type" : "nGram",                    "min_gram" : "2",                    "max_gram" : "3",                    "token_chars": [ "letter", "digit" ]

In short:

the tokenizer, depending on the configuration, will create tokens. In this example: FC, Schalke, 04.
nGram generates groups of characters of minimum min_gram size and maximum max_gram size from an input text. Basically, the tokens are split into small chunks and each chunk is anchored on a character (it doesn't matter where this character is, all of them will create chunks).
edgeNGram does the same but the chunks always start from the beginning of each token. Basically, the chunks are anchored at the beginning of the tokens.

For the same text as above, an edgeNGram generates this: FC, Sc, Sch, Scha, Schal, 04. Every "word" in the text is considered and for every "word" the first character is the starting point (F from FC, S from Schalke and 0 from 04).

elasticsearch token analyzer

ngram moves the cursor while breaking the text:

Text: Red WineOptions:    ngram_min: 2    ngram_max: 3Result: Re, Red, ed, Wi, Win, in, ine, ne

As you see here, the cursor moves ngram_min times to the next fragment until it reaches the ngram_max.

ngram_edge does the exact same thing as ngram but it doesn't move the cursor:

Text: Red WineOptions:    ngram_min: 2    ngram_max: 3Result: Re, Red

Why didn't it return Win? because the cursor doesn't move, it'll always start from the position zero, moves ngram_min times and backs to the same position (which is always zero).

Think of ngram_edge as if it was a substring function in other programming languages such as JavaScript:

// ngramlet str = "Red Wine";console.log(str.substring(0, 2)); // Reconsole.log(str.substring(0, 3)); // Redconsole.log(str.substring(1, 3)); // ed, start from position 1// ...// ngram_edge// notice that the position is always zeroconsole.log(str.substring(0, 2)); // Reconsole.log(str.substring(0, 3)); // Red

Try it out by yourself using Kibana:

PUT my_index{  "settings": {    "analysis": {      "tokenizer": {        "my_ngram_tokenizer" : {          "type" : "ngram",          "min_gram": 2,          "max_gram": 3,          "token_chars": [            "letter",            "digit"          ]        },        "my_edge_ngram_tokenizer": {          "type": "edge_ngram",          "min_gram": 2,          "max_gram": 3        }      }    }  }}POST my_index/_analyze{  "tokenizer": "my_ngram_tokenizer",  "text": "Red Wine"}POST my_index/_analyze{  "tokenizer": "my_edge_ngram_tokenizer",   "text": "Red Wine"}

CodeHunter

how edge ngram token filter differs from ngram token filter?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last