realtime querying/aggregating millions of records - hadoop? hbase? cassandra? realtime querying/aggregating millions of records - hadoop? hbase? cassandra? hadoop hadoop

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?


Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds

HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.

check out http://en.wikipedia.org/wiki/Standard_deviation

stddev(X) = sqrt(E[X^2]- (E[X])^2)

this implies that you can get the stddev of AB by doing

sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)


Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.


It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.