Analytics in Elasticsearch

hadoop elasticsearch aggregate olap-cube

Firstly I do not think that using ElasticSearch for OLAP-like queries is a good idea. I would recommend you use some Dremel-like technologies (Impala, TEZ, Storm etc.) which support sql you specified. It has some advantages like:

you do not have to transfer data from Hadoop into ElasticSearch.
you can use sql
you do not have to care about parsing json from ElasticSearch query answers.

Do not get me wrong, I love ElasticSearch/Logstash/Kibana, but for log collecting and visualization. Of course it is possible to make some advanced queries but it has some limitations which I found out in my personal projects.

Also consider using Kibana, it is a great tool for data statistics in ElasticSearch and you can do a lot of things with it.

Here are some queries example as you requested (i did not tested it):

{  "aggs": {    "website": {      "terms": {        "field": "website"      },      "aggs": {        "sum_stats": {          "sum": {            "field": "stats"          },          "aggs": {            "distinct_user": {              "cardinality": {                "field": "user_id",                "precision_threshold": 100              }            }          }                  }      }    }  }}

2-6 are similar, use things from 1) with different filters like this:

{  "aggs": {    "your_filter": {      "filter": {        "term": {"event_src_id" : "XXXXXXXXXXX"}      }    },    "aggs": {      "website": {        "terms": {          "field": "website"        },        "aggs": {          "sum_stats": {            "sum": {              "field": "stats"            },            "aggs": {              "distinct_user": {                "cardinality": {                  "field": "user_id",                  "precision_threshold": 100                }              }            }          }        }      }    }  }}

7) DELETE is quite easy

    "query" : {        "term" : { "event_src_id" : "XXXXXXXXXXX" }    }}

hadoop elasticsearch aggregate olap-cube

how to group and sum and distinct at same time

Aggregations can have sub-aggregations.

First, the group functionality corresponds to the terms aggregation and (sometimes) the top_hits aggregation.Second, There is a sum aggregation, a simple stats metric aggregation.Finally, Your use of distinct in this case is to perform count(distinct), which corresponds to a cardinality aggregation, which can be approximate or exact depending on your needs.

7) delete from table where event_src_id=XXXXXXXXXXX; (may delete all 96M records)

There is a delete by query api which you can use, but be careful about high percentages of deleted documents; Lucene and Elasticsearch are not optimized for this, and you will incur an overhead due to delete markers in the data.

Examples

select website, sum(stats), count(distinct(user_id)) from table group by website

GET /_search{   "aggs": {      "website_stats": {        "terms": {           "field": "website"        },        "aggs": {           "sum_stats": {             "sum": {               "field": "stats"             }           },           "count_users": {             "cardinality": {               "field": "user_id"            }          }        }      }   }}

select website, sum(stats), count(distinct(user_id)) from table where YEAR(Time) = 2009 group by website, MONTH(Time)

GET /_search{   "query": {     "filter": {       "range": {           "Time": {              "gt": "2009-01-01 00:00:00",              "lt": "2010-01-01 00:00:00"           }       }     }   },   "aggs": {      "monthly_stats" {        "terms": {           "field": "website"        },        "aggs": {           "months": {              "date_histogram": {                "field": "Time",                "interval": "month"              },              "aggs" : {                "sum_stats": {                  "sum": {                    "field": "stats"                  }                },                "count_users": {                  "cardinality": {                    "field": "user_id"                }              }            }          }        }      }   }}

CodeHunter

Analytics in Elasticsearch

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last