Apache Storm compared to Hadoop Apache Storm compared to Hadoop hadoop hadoop

Apache Storm compared to Hadoop


Why don't you tell your opinion.

Twitter Storm has been touted as real time Hadoop. That is more a marketing take for easy consumption.

They are superficially similar since both are distributed application solutions. Apart from the typical distributed architectural elements like master/slave, zookeeper based coordination, to me comparison falls off the cliff.

Twitter is more like a pipline for processing data as it comes. The pipe is what connects various computing nodes that receive data, compute and deliver output. (There lingo is spouts and bolts) Extend this analogy to a complex pipeline wiring that can be re-engineered when required and you get Twitter Storm.

In nut shell it processes data as it comes. There is no latency.

Hadoop how ever is different in this respect primarily due to HDFS. It a solution geared to distributed storage and tolerance to outage of many scales (disks, machines, racks etc)

M/R is built to leverage data localization on HDFS to distribute computational jobs. Together, they do not provide facility for real time data processing. But that is not always a requirement when you are looking through large data. (needle in the haystack analogy)

In short, Twitter Storm is a distributed real time data processing solution. I don't think we should compare them. Twitter built it because it needed a facility to process small tweets but humungous number of them and in real time.

See: HStreaming if you are compelled to compare it with some thing


Basically, both of them are used for analyzing big data, but Storm is used for real time processing while Hadoop is used for batch processing.

This is a very good introduction to Storm that I found: Click here


Rather than to be compared, they are supposed to supplement each other now having batch + real-time (pseudo-real time) processing. There is a corresponding video presentation - Ted Dunning on Twitter's Storm