add elision filter to snowball

ruby-on-rails elasticsearch stop-words snowball

Since an analyzer is simply the combination of a tokenizer and zero or more filters, you can build your own custom snowball analyzer, which mimics the "defaults" and adds on the top your own filters, such as an elision token filter.

As stated in the snowball analyzer documentation:

An analyzer of type snowball that uses the standard tokenizer, with standard filter, lowercase filter, stop filter, and snowball filter.

So here is an example which contains both implementations, a snowball analyzer with default filters plus custom stopwords and elision, and a language analyzer whith a custom list of stopwords:

{  "settings": {    "analysis": {      "analyzer": {       "custom_snowball_analyzer": {          "tokenizer": "standard",          "filter": [            "standard",            "lowercase",            "stop",            "snowball",            "custom_stop",            "custom_elision"          ]        },        "custom_language_analyzer": {          "type": "french",          "stopwords": ["a", "à", "t"]        }      },      "filter": {        "custom_stop": {          "type": "stop",          "stopwords": ["a", "à", "t"]        },        "custom_elision": {          "type": "elision",          "articles": ["l", "m", "t", "qu", "n", "s", "j"]        }      }    }  }}

Let's see the tokens produced by both analyzers, using the same testing sentence:

curl -sXGET 'http://localhost:9200/testing/_analyze?analyzer=custom_snowball_analyzer&pretty' -d "Il a de la chance, parait-t-il que l'amour est dans le pré, mais au final à quoi bon ?." | grep token  "tokens" : [ {    "token" : "il",    "token" : "de",    "token" : "la",    "token" : "chanc",    "token" : "parait",    "token" : "il",    "token" : "que",    "token" : "amour",    "token" : "est",    "token" : "dan",    "token" : "le",    "token" : "pré",    "token" : "mai",    "token" : "au",    "token" : "final",    "token" : "quoi",    "token" : "bon",curl -sXGET 'http://localhost:9200/testing/_analyze?analyzer=custom_language_analyzer&pretty' -d "Il a de la chance, parait-t-il que l'amour est dans le pré, mais au final à quoi bon ?." | grep token  "tokens" : [ {    "token" : "il",    "token" : "de",    "token" : "la",    "token" : "chanc",    "token" : "parait",    "token" : "il",    "token" : "que",    "token" : "amou",    "token" : "est",    "token" : "dan",    "token" : "le",    "token" : "pré",    "token" : "mai",    "token" : "au",    "token" : "final",    "token" : "quoi",    "token" : "bon",

As you can see, both analyzers produces almost the exact same tokens, except for "amour", which has not been stemmed, I don't know why to be honest, since the snowball filter uses a stemmer under the hood.

About your second question, those filters only affect indexing time (during tokenization step), I would say that both implementations will perform almost equally (the language analyzer should be slightly faster since it only stem french words in this example)and won't be noticeable unless you plan to index huge docs under heavy load.

Search response times should be similar because the tokens are almost the same (if you index french documents only), so I think Lucene will provide the same performances.

To conclude, I would choose the language analyzer if you are indexing french documents only, since it's far smaller in the mapping definition :-)

CodeHunter

add elision filter to snowball

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last