Hadoop and analytics? Hadoop and analytics? hadoop hadoop

Hadoop and analytics?


Twitter has open sourced Storm, they call it the Hadoop of realtime processing. The use case stream processing and distributed rpc meets the above mentioned requirements. Note that this is not dependent on Hadoop. Here is a presentation on Storm. Then there are HStreaming which sits on top of Hadoop, S4, Streambases.

Plain Hadoop suits batch processing and is not for real time analytics. The above are some of the s/w for real time analytics. Some of them sit on top of Hadoop (like HStreaming) and some don't. Some are are free and some are commercial. There are a bunch of variants, based on a detailed requirement study, features supported by the different s/w and finally a proof of concept a s/w can be finalized.


Indeed, the core of Hadoop is batch-oriented, which makes it better for periodic reporting, not realtime data analysis.

One option is to use a graphing and logging system dedicated for event processing. In this case, it seems like a tool like Graphite would be perfect for your needs. There is a post on the Etsy engineering blog that describes how this could be used.

If you like Hadoop, you use something that's built on top of Hadoop, such as OpenTSDB, which uses HBase.


Its true that hadoop (well map reduce), if for batch processing.However, hadoop is also a distributed fs system.As real time data enters your cluster, you can have worker nodes process it as its becomes available.

For example, if you want to update some dashboard every 5 minutes, you can setup a demon that reads from hdfs, all the newly added log files from the individual tracking server, and updates the storage where your web app reads its data.

At the end of the day, using map reduce you will do the same your demmon has done, but this time using all the files of the day, and all the nodes in your cluster.