Elasticsearch SQL like subquery aggregation

elasticsearch group-by report elasticsearch-aggregation

As I understand there's no way to refer aggregation result in filter of the same query. So you can solve only part of the puzzle with single query:

GET /purchases/fruits/_search{  "query": {    "filtered":{       "filter": {        "range": {          "BoughtDate": {            "gte": "2015-01-01", //assuming you have right mapping for dates            "lte": "2016-03-01"          }        }      }    }  },  "sort": { "BoughtDate": { "order": "desc" }},  "aggs": {    "byBoughtDate": {      "terms": {        "field": "BoughtDate",        "order" : { "_term" : "desc" }      },      "aggs": {        "distinctCount": {           "cardinality": {             "field": "Fruit"           }         }      }    }  }}

So you will have all the documents within the range of dates, and you will have aggregated bucket counts, sorted by term, so max date will be on the top. Client can parse this first bucket (both count and value), and then take the documents for this date value. For distinct fruit count you just use nested cardinality aggregation.

Yep, the query returns much more info than you needed, but that's the life :)

elasticsearch group-by report elasticsearch-aggregation

Naturally there's no direct route from SQL to the Elasticsearch DSL, but there are some pretty common correlations.

For starters, any GROUP BY / HAVING is going to come down to an aggregation. The normal query semantics can generally be covered (and more so) by the Query DSL.

How prepare a subsed of data before aggregation (like in this example the latest row in the range per each Fruit)

So, you're kind of asking for two different things.

How prepare a subsed of data before aggregation

This is the query phase.

(like in this example the latest row in the range per each Fruit)

You're technically asking it to aggregate to get the answer to this example: not a normal query. In your example, you're doing MAX to get this which is in effect using a GROUP BY to get it.

How to group results by multiple fields

It depends. Do you want them tiered (generally, yes) or do you want them together.

If you want them tiered, then you just use sub-aggregations to get what you want. If you want them combined, then you generally just use a filters aggregation for the different groupings.

Putting it all back together: You want the most recent purchase, per fruit, given a certain filtered date range. The date ranges are just normal queries / filters:

{  "query": {    "bool": {      "filter": [        {          "range": {            "BoughtDate": {              "gte": "2016-01-01",              "lte": "2016-01-31"            }          }        },        {          "range": {            "BestBeforeDate": {              "gte": "2016-01-01",              "lte": "2016-01-31"            }          }        }      ]    }  }}

With that, no document will be included in the request that is not within those date ranges for both fields (effectively an AND). Because I used a filter, it's unscored and cacheable.

Now, you need to start aggregating to get the rest of the information. Let's start by assuming the documents have been filtered using the above filter in order to simplify what we're looking at. We'll combine it at the end.

{  "size": 0,  "aggs": {    "group_by_date": {      "date_histogram": {        "field": "BoughtDate",        "interval": "day",        "min_doc_count": 1      },      "aggs": {        "group_by_store": {          "terms": {            "field": "BoughtInStore"          },          "aggs": {            "group_by_person": {              "terms": {                "field": "BiteBy"              }            }          }        }      }    }  }}

You want "size" : 0 at the top level because you don't actually care about hits. You only want aggregated results.

Your first aggregation was actually grouping by the most recent date. I changed it a little to make it a little more realistic (each day), but it's effectively the same. The way that you use MAX, we could use a terms aggregation with "size": 1, but this is truer to how you'd want to do it when a date (and presumably time!) is involved. I also asked it to ignore days in the matching documents that have no data (since it's going from the start to the end, we don't actually care about those days).

If you really only wanted the last day, then you could use a pipeline aggregation to drop everything except the max bucket, but a realistic usage of this type of request would want the full date range.

So, we then continue by grouping by store, which is what you want. Then, we sub-group by person (BiteBy). This will give you the count implicitly.

Putting it all back together:

{  "size": 0,  "query": {    "bool": {      "filter": [        {          "range": {            "BoughtDate": {              "gte": "2016-01-01",              "lte": "2016-01-31"            }          }        },        {          "range": {            "BestBeforeDate": {              "gte": "2016-01-01",              "lte": "2016-01-31"            }          }        }      ]    }  },  "aggs": {    "group_by_date": {      "date_histogram": {        "field": "BoughtDate",        "interval": "day",        "min_doc_count": 1      },      "aggs": {        "group_by_store": {          "terms": {            "field": "BoughtInStore"          },          "aggs": {            "group_by_person": {              "terms": {                "field": "BiteBy"              }            }          }        }      }    }  }}

Note: Here's how I indexed the data.

PUT /grocery/store/_bulk{"index":{"_id":"1"}}{"Fruit":"Banana","BoughtInStore":"Jungle","BoughtDate":"2016-01-01","BestBeforeDate":"2016-01-02","BiteBy":"John"}{"index":{"_id":"2"}}{"Fruit":"Banana","BoughtInStore":"Jungle","BoughtDate":"2016-01-02","BestBeforeDate":"2016-01-04","BiteBy":"Mat"}{"index":{"_id":"3"}}{"Fruit":"Banana","BoughtInStore":"Jungle","BoughtDate":"2016-01-03","BestBeforeDate":"2016-01-05","BiteBy":"Mark"}{"index":{"_id":"4"}}{"Fruit":"Banana","BoughtInStore":"Jungle","BoughtDate":"2016-01-04","BestBeforeDate":"2016-02-01","BiteBy":"Simon"}{"index":{"_id":"5"}}{"Fruit":"Orange","BoughtInStore":"Jungle","BoughtDate":"2016-01-12","BestBeforeDate":"2016-01-12","BiteBy":"John"}{"index":{"_id":"6"}}{"Fruit":"Orange","BoughtInStore":"Jungle","BoughtDate":"2016-01-14","BestBeforeDate":"2016-01-16","BiteBy":"Mark"}{"index":{"_id":"7"}}{"Fruit":"Orange","BoughtInStore":"Jungle","BoughtDate":"2016-01-20","BestBeforeDate":"2016-01-21","BiteBy":"Simon"}{"index":{"_id":"8"}}{"Fruit":"Kiwi","BoughtInStore":"Shop","BoughtDate":"2016-01-21","BestBeforeDate":"2016-01-21","BiteBy":"Mark"}{"index":{"_id":"9"}}{"Fruit":"Kiwi","BoughtInStore":"Jungle","BoughtDate":"2016-01-21","BestBeforeDate":"2016-01-21","BiteBy":"Simon"}

It's critical that your string values that you want to aggregate on (store and person) are not_analyzed strings (keyword in ES 5.0)! Otherwise it will use what's called fielddata and that's not a good thing.

The mappings would look like this in ES 1.x / ES 2.x:

PUT /grocery{  "settings": {    "number_of_shards": 1  },   "mappings": {    "store": {      "properties": {        "Fruit": {          "type": "string",          "index": "not_analyzed"        },        "BoughtInStore": {          "type": "string",          "index": "not_analyzed"        },        "BiteBy": {          "type": "string",          "index": "not_analyzed"        },        "BestBeforeDate": {          "type": "date"        },        "BoughtDate": {          "type": "date"        }      }    }  }}

All of this together, and you get the answer as:

{  "took": 8,  "timed_out": false,  "_shards": {    "total": 1,    "successful": 1,    "failed": 0  },  "hits": {    "total": 8,    "max_score": 0,    "hits": []  },  "aggregations": {    "group_by_date": {      "buckets": [        {          "key_as_string": "2016-01-01T00:00:00.000Z",          "key": 1451606400000,          "doc_count": 1,          "group_by_store": {            "doc_count_error_upper_bound": 0,            "sum_other_doc_count": 0,            "buckets": [              {                "key": "Jungle",                "doc_count": 1,                "group_by_person": {                  "doc_count_error_upper_bound": 0,                  "sum_other_doc_count": 0,                  "buckets": [                    {                      "key": "John",                      "doc_count": 1                    }                  ]                }              }            ]          }        },        {          "key_as_string": "2016-01-02T00:00:00.000Z",          "key": 1451692800000,          "doc_count": 1,          "group_by_store": {            "doc_count_error_upper_bound": 0,            "sum_other_doc_count": 0,            "buckets": [              {                "key": "Jungle",                "doc_count": 1,                "group_by_person": {                  "doc_count_error_upper_bound": 0,                  "sum_other_doc_count": 0,                  "buckets": [                    {                      "key": "Mat",                      "doc_count": 1                    }                  ]                }              }            ]          }        },        {          "key_as_string": "2016-01-03T00:00:00.000Z",          "key": 1451779200000,          "doc_count": 1,          "group_by_store": {            "doc_count_error_upper_bound": 0,            "sum_other_doc_count": 0,            "buckets": [              {                "key": "Jungle",                "doc_count": 1,                "group_by_person": {                  "doc_count_error_upper_bound": 0,                  "sum_other_doc_count": 0,                  "buckets": [                    {                      "key": "Mark",                      "doc_count": 1                    }                  ]                }              }            ]          }        },        {          "key_as_string": "2016-01-12T00:00:00.000Z",          "key": 1452556800000,          "doc_count": 1,          "group_by_store": {            "doc_count_error_upper_bound": 0,            "sum_other_doc_count": 0,            "buckets": [              {                "key": "Jungle",                "doc_count": 1,                "group_by_person": {                  "doc_count_error_upper_bound": 0,                  "sum_other_doc_count": 0,                  "buckets": [                    {                      "key": "John",                      "doc_count": 1                    }                  ]                }              }            ]          }        },        {          "key_as_string": "2016-01-14T00:00:00.000Z",          "key": 1452729600000,          "doc_count": 1,          "group_by_store": {            "doc_count_error_upper_bound": 0,            "sum_other_doc_count": 0,            "buckets": [              {                "key": "Jungle",                "doc_count": 1,                "group_by_person": {                  "doc_count_error_upper_bound": 0,                  "sum_other_doc_count": 0,                  "buckets": [                    {                      "key": "Mark",                      "doc_count": 1                    }                  ]                }              }            ]          }        },        {          "key_as_string": "2016-01-20T00:00:00.000Z",          "key": 1453248000000,          "doc_count": 1,          "group_by_store": {            "doc_count_error_upper_bound": 0,            "sum_other_doc_count": 0,            "buckets": [              {                "key": "Jungle",                "doc_count": 1,                "group_by_person": {                  "doc_count_error_upper_bound": 0,                  "sum_other_doc_count": 0,                  "buckets": [                    {                      "key": "Simon",                      "doc_count": 1                    }                  ]                }              }            ]          }        },        {          "key_as_string": "2016-01-21T00:00:00.000Z",          "key": 1453334400000,          "doc_count": 2,          "group_by_store": {            "doc_count_error_upper_bound": 0,            "sum_other_doc_count": 0,            "buckets": [              {                "key": "Jungle",                "doc_count": 1,                "group_by_person": {                  "doc_count_error_upper_bound": 0,                  "sum_other_doc_count": 0,                  "buckets": [                    {                      "key": "Simon",                      "doc_count": 1                    }                  ]                }              },              {                "key": "Shop",                "doc_count": 1,                "group_by_person": {                  "doc_count_error_upper_bound": 0,                  "sum_other_doc_count": 0,                  "buckets": [                    {                      "key": "Mark",                      "doc_count": 1                    }                  ]                }              }            ]          }        }      ]    }  }}

CodeHunter

Elasticsearch SQL like subquery aggregation

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last