Elasticsearch : Strip HTML tags before indexing docs with html_strip filter not working

html elasticsearch filter full-text-search mapping

You are confusing the "_source" field in the response to return what is being analyzed and indexed.It looks like your expectation is that the _source field in response returns the analyzed document. This is incorrect.

From the documentation ;

The _source field contains the original JSON document body that was passed at index time. The _source field itself is not indexed (and thus is not searchable), but it is stored so that it can be returned when executing fetch requests, like get or search.

Ideally in the above case wherein you want to format the source data for presentation purposes it should be done at the client end.

However that being said one way to achieve it for the above use case is using script fields and keyword-tokenizer as follows :

PUT test{   "settings": {      "analysis": {         "analyzer": {            "my_html_analyzer": {               "type": "custom",               "tokenizer": "standard",               "char_filter": [                  "html_strip"               ]            },            "parsed_analyzer": {               "type": "custom",               "tokenizer": "keyword",               "char_filter": [                  "html_strip"               ]            }         }      }   },   "mappings": {      "test": {         "properties": {            "body": {               "type": "string",               "analyzer": "my_html_analyzer",               "fields": {                  "parsed": {                     "type": "string",                     "analyzer": "parsed_analyzer"                  }               }            }         }      }   }}PUT test/test/1 {    "body" : "Title <p> Some déjà vu <a href='http://somedomain.com'> website </a> <span> this is inline </span></p> "}GET test/_search{  "query" : {    "match_all" : { }  },  "script_fields": {    "terms" : {        "script": "doc[field].values",        "params": {            "field": "body.parsed"        }    }  }}

Result:

{   "_index": "test",   "_type": "test",   "_id": "1",   "_score": 1,   "fields": {        "terms": [            "Title \n Some déjà vu  website   this is inline \n "           ]        }   }

note I believe the above is a bad idea since stripping the html tags could be easily achived on the client end and you would have much more control with regard to formatting than depending on a work around such as this. More importantly it maybe performant doing it on the client side.

CodeHunter

Elasticsearch : Strip HTML tags before indexing docs with html_strip filter not working

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last