add elision filter to snowball add elision filter to snowball elasticsearch elasticsearch

add elision filter to snowball


Since an analyzer is simply the combination of a tokenizer and zero or more filters, you can build your own custom snowball analyzer, which mimics the "defaults" and adds on the top your own filters, such as an elision token filter.

As stated in the snowball analyzer documentation:

An analyzer of type snowball that uses the standard tokenizer, with standard filter, lowercase filter, stop filter, and snowball filter.

So here is an example which contains both implementations, a snowball analyzer with default filters plus custom stopwords and elision, and a language analyzer whith a custom list of stopwords:

{  "settings": {    "analysis": {      "analyzer": {       "custom_snowball_analyzer": {          "tokenizer": "standard",          "filter": [            "standard",            "lowercase",            "stop",            "snowball",            "custom_stop",            "custom_elision"          ]        },        "custom_language_analyzer": {          "type": "french",          "stopwords": ["a", "à", "t"]        }      },      "filter": {        "custom_stop": {          "type": "stop",          "stopwords": ["a", "à", "t"]        },        "custom_elision": {          "type": "elision",          "articles": ["l", "m", "t", "qu", "n", "s", "j"]        }      }    }  }}

Let's see the tokens produced by both analyzers, using the same testing sentence:

curl -sXGET 'http://localhost:9200/testing/_analyze?analyzer=custom_snowball_analyzer&pretty' -d "Il a de la chance, parait-t-il que l'amour est dans le pré, mais au final à quoi bon ?." | grep token  "tokens" : [ {    "token" : "il",    "token" : "de",    "token" : "la",    "token" : "chanc",    "token" : "parait",    "token" : "il",    "token" : "que",    "token" : "amour",    "token" : "est",    "token" : "dan",    "token" : "le",    "token" : "pré",    "token" : "mai",    "token" : "au",    "token" : "final",    "token" : "quoi",    "token" : "bon",curl -sXGET 'http://localhost:9200/testing/_analyze?analyzer=custom_language_analyzer&pretty' -d "Il a de la chance, parait-t-il que l'amour est dans le pré, mais au final à quoi bon ?." | grep token  "tokens" : [ {    "token" : "il",    "token" : "de",    "token" : "la",    "token" : "chanc",    "token" : "parait",    "token" : "il",    "token" : "que",    "token" : "amou",    "token" : "est",    "token" : "dan",    "token" : "le",    "token" : "pré",    "token" : "mai",    "token" : "au",    "token" : "final",    "token" : "quoi",    "token" : "bon",

As you can see, both analyzers produces almost the exact same tokens, except for "amour", which has not been stemmed, I don't know why to be honest, since the snowball filter uses a stemmer under the hood.

About your second question, those filters only affect indexing time (during tokenization step), I would say that both implementations will perform almost equally (the language analyzer should be slightly faster since it only stem french words in this example)and won't be noticeable unless you plan to index huge docs under heavy load.

Search response times should be similar because the tokens are almost the same (if you index french documents only), so I think Lucene will provide the same performances.

To conclude, I would choose the language analyzer if you are indexing french documents only, since it's far smaller in the mapping definition :-)