Analytics in Elasticsearch
Firstly I do not think that using ElasticSearch for OLAP-like queries is a good idea. I would recommend you use some Dremel-like technologies (Impala, TEZ, Storm etc.) which support sql you specified. It has some advantages like:
- you do not have to transfer data from Hadoop into ElasticSearch.
- you can use sql
- you do not have to care about parsing json from ElasticSearch query answers.
Do not get me wrong, I love ElasticSearch/Logstash/Kibana, but for log collecting and visualization. Of course it is possible to make some advanced queries but it has some limitations which I found out in my personal projects.
Also consider using Kibana, it is a great tool for data statistics in ElasticSearch and you can do a lot of things with it.
Here are some queries example as you requested (i did not tested it):
1)
{ "aggs": { "website": { "terms": { "field": "website" }, "aggs": { "sum_stats": { "sum": { "field": "stats" }, "aggs": { "distinct_user": { "cardinality": { "field": "user_id", "precision_threshold": 100 } } } } } } }}
2-6 are similar, use things from 1) with different filters like this:
{ "aggs": { "your_filter": { "filter": { "term": {"event_src_id" : "XXXXXXXXXXX"} } }, "aggs": { "website": { "terms": { "field": "website" }, "aggs": { "sum_stats": { "sum": { "field": "stats" }, "aggs": { "distinct_user": { "cardinality": { "field": "user_id", "precision_threshold": 100 } } } } } } } }}
7) DELETE is quite easy
"query" : { "term" : { "event_src_id" : "XXXXXXXXXXX" } }}
how to group and sum and distinct at same time
Aggregations can have sub-aggregations.
First, the group functionality corresponds to the terms aggregation and (sometimes) the top_hits aggregation.Second, There is a sum aggregation, a simple stats metric aggregation.Finally, Your use of distinct in this case is to perform count(distinct), which corresponds to a cardinality aggregation, which can be approximate or exact depending on your needs.
7) delete from table where event_src_id=XXXXXXXXXXX; (may delete all 96M records)
There is a delete by query api which you can use, but be careful about high percentages of deleted documents; Lucene and Elasticsearch are not optimized for this, and you will incur an overhead due to delete markers in the data.
Examples
select website, sum(stats), count(distinct(user_id)) from table group by website
GET /_search{ "aggs": { "website_stats": { "terms": { "field": "website" }, "aggs": { "sum_stats": { "sum": { "field": "stats" } }, "count_users": { "cardinality": { "field": "user_id" } } } } }}
select website, sum(stats), count(distinct(user_id)) from table where YEAR(Time) = 2009 group by website, MONTH(Time)
GET /_search{ "query": { "filter": { "range": { "Time": { "gt": "2009-01-01 00:00:00", "lt": "2010-01-01 00:00:00" } } } }, "aggs": { "monthly_stats" { "terms": { "field": "website" }, "aggs": { "months": { "date_histogram": { "field": "Time", "interval": "month" }, "aggs" : { "sum_stats": { "sum": { "field": "stats" } }, "count_users": { "cardinality": { "field": "user_id" } } } } } } }}