Elasticsearch : Strip HTML tags before indexing docs with html_strip filter not working Elasticsearch : Strip HTML tags before indexing docs with html_strip filter not working elasticsearch elasticsearch

Elasticsearch : Strip HTML tags before indexing docs with html_strip filter not working


You are confusing the "_source" field in the response to return what is being analyzed and indexed.It looks like your expectation is that the _source field in response returns the analyzed document. This is incorrect.

From the documentation ;

The _source field contains the original JSON document body that was passed at index time. The _source field itself is not indexed (and thus is not searchable), but it is stored so that it can be returned when executing fetch requests, like get or search.

Ideally in the above case wherein you want to format the source data for presentation purposes it should be done at the client end.

However that being said one way to achieve it for the above use case is using script fields and keyword-tokenizer as follows :

PUT test{   "settings": {      "analysis": {         "analyzer": {            "my_html_analyzer": {               "type": "custom",               "tokenizer": "standard",               "char_filter": [                  "html_strip"               ]            },            "parsed_analyzer": {               "type": "custom",               "tokenizer": "keyword",               "char_filter": [                  "html_strip"               ]            }         }      }   },   "mappings": {      "test": {         "properties": {            "body": {               "type": "string",               "analyzer": "my_html_analyzer",               "fields": {                  "parsed": {                     "type": "string",                     "analyzer": "parsed_analyzer"                  }               }            }         }      }   }}PUT test/test/1 {    "body" : "Title <p> Some déjà vu <a href='http://somedomain.com'> website </a> <span> this is inline </span></p> "}GET test/_search{  "query" : {    "match_all" : { }  },  "script_fields": {    "terms" : {        "script": "doc[field].values",        "params": {            "field": "body.parsed"        }    }  }}

Result:

{   "_index": "test",   "_type": "test",   "_id": "1",   "_score": 1,   "fields": {        "terms": [            "Title \n Some déjà vu  website   this is inline \n "           ]        }   }

note I believe the above is a bad idea since stripping the html tags could be easily achived on the client end and you would have much more control with regard to formatting than depending on a work around such as this. More importantly it maybe performant doing it on the client side.