Getting distinct values using NEST ElasticSearch client Getting distinct values using NEST ElasticSearch client elasticsearch elasticsearch

Getting distinct values using NEST ElasticSearch client


You are correct that what you want is a terms aggregation. The problem you're running into is that ES is splitting the field "BrandName" in the results it is returning. This is the expected default behavior of a field in ES.

What I recommend is that you change BrandName into a "Multifield", this will allow you to search on all the various parts, as well as doing a terms aggregation on the "Not Analyzed" (aka full "20th Century Fox") term.

Here is the documentation from ES.

https://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html

[UPDATE]If you are using ES version 1.4 or newer the syntax for multi-fields is a little different now.

https://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html#_multi_fields

Here is a full working sample the illustrate the point in ES 1.4.4. Note the mapping specifies a "not_analyzed" version of the field.

PUT hilden1PUT hilden1/type1/_mapping{  "properties": {    "brandName": {      "type": "string",      "fields": {        "raw": {          "type": "string",          "index": "not_analyzed"        }      }    }  }}POST hilden1/type1{  "brandName": "foo"}POST hilden1/type1{  "brandName": "bar"}POST hilden1/type1{  "brandName": "20th Century Fox"}POST hilden1/type1{  "brandName": "20th Century Fox"}POST hilden1/type1{  "brandName": "foo bar"}GET hilden1/type1/_search{  "size": 0,   "aggs": {    "analyzed_field": {      "terms": {        "field": "brandName",        "size": 10      }    },    "non_analyzed_field": {      "terms": {        "field": "brandName.raw",        "size": 10      }    }      }}

Results of the last query:

{   "took": 3,   "timed_out": false,   "_shards": {      "total": 5,      "successful": 5,      "failed": 0   },   "hits": {      "total": 5,      "max_score": 0,      "hits": []   },   "aggregations": {      "non_analyzed_field": {         "doc_count_error_upper_bound": 0,         "sum_other_doc_count": 0,         "buckets": [            {               "key": "20th Century Fox",               "doc_count": 2            },            {               "key": "bar",               "doc_count": 1            },            {               "key": "foo",               "doc_count": 1            },            {               "key": "foo bar",               "doc_count": 1            }         ]      },      "analyzed_field": {         "doc_count_error_upper_bound": 0,         "sum_other_doc_count": 0,         "buckets": [            {               "key": "20th",               "doc_count": 2            },            {               "key": "bar",               "doc_count": 2            },            {               "key": "century",               "doc_count": 2            },            {               "key": "foo",               "doc_count": 2            },            {               "key": "fox",               "doc_count": 2            }         ]      }   }}

Notice that not-analyzed fields keep "20th century fox" and "foo bar" together where as the analyzed field breaks them up.


I had a similar issue. I was displaying search results and wanted to show counts on the category and sub category.

You're right to use aggregations. I also had the issue with the strings being tokenised (i.e. 20th century fox being split) - this happens because the fields are analysed. For me, I added the following mappings (i.e. tell ES not to analyse that field):

  "category": {          "type": "nested",          "properties": {            "CategoryNameAndSlug": {              "type": "string",              "index": "not_analyzed"            },            "SubCategoryNameAndSlug": {              "type": "string",              "index": "not_analyzed"            }          }        }

As jhilden suggested, if you use this field for more than one reason (e.g. search and aggregation) you can set it up as a multifield. So on one hand it can get analysed and used for searching and on the other hand for not being analysed for aggregation.