Getting distinct values using NEST ElasticSearch client
You are correct that what you want is a terms aggregation. The problem you're running into is that ES is splitting the field "BrandName" in the results it is returning. This is the expected default behavior of a field in ES.
What I recommend is that you change BrandName into a "Multifield", this will allow you to search on all the various parts, as well as doing a terms aggregation on the "Not Analyzed" (aka full "20th Century Fox") term.
Here is the documentation from ES.
https://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html
[UPDATE]If you are using ES version 1.4 or newer the syntax for multi-fields is a little different now.
Here is a full working sample the illustrate the point in ES 1.4.4. Note the mapping specifies a "not_analyzed" version of the field.
PUT hilden1PUT hilden1/type1/_mapping{ "properties": { "brandName": { "type": "string", "fields": { "raw": { "type": "string", "index": "not_analyzed" } } } }}POST hilden1/type1{ "brandName": "foo"}POST hilden1/type1{ "brandName": "bar"}POST hilden1/type1{ "brandName": "20th Century Fox"}POST hilden1/type1{ "brandName": "20th Century Fox"}POST hilden1/type1{ "brandName": "foo bar"}GET hilden1/type1/_search{ "size": 0, "aggs": { "analyzed_field": { "terms": { "field": "brandName", "size": 10 } }, "non_analyzed_field": { "terms": { "field": "brandName.raw", "size": 10 } } }}
Results of the last query:
{ "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 5, "max_score": 0, "hits": [] }, "aggregations": { "non_analyzed_field": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "20th Century Fox", "doc_count": 2 }, { "key": "bar", "doc_count": 1 }, { "key": "foo", "doc_count": 1 }, { "key": "foo bar", "doc_count": 1 } ] }, "analyzed_field": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "20th", "doc_count": 2 }, { "key": "bar", "doc_count": 2 }, { "key": "century", "doc_count": 2 }, { "key": "foo", "doc_count": 2 }, { "key": "fox", "doc_count": 2 } ] } }}
Notice that not-analyzed fields keep "20th century fox" and "foo bar" together where as the analyzed field breaks them up.
I had a similar issue. I was displaying search results and wanted to show counts on the category and sub category.
You're right to use aggregations. I also had the issue with the strings being tokenised (i.e. 20th century fox being split) - this happens because the fields are analysed. For me, I added the following mappings (i.e. tell ES not to analyse that field):
"category": { "type": "nested", "properties": { "CategoryNameAndSlug": { "type": "string", "index": "not_analyzed" }, "SubCategoryNameAndSlug": { "type": "string", "index": "not_analyzed" } } }
As jhilden suggested, if you use this field for more than one reason (e.g. search and aggregation) you can set it up as a multifield. So on one hand it can get analysed and used for searching and on the other hand for not being analysed for aggregation.