Hadoop, Mahout real-time processing alternative Hadoop, Mahout real-time processing alternative hadoop hadoop

Hadoop, Mahout real-time processing alternative


You are right, Hadoop is designed for batch-type processing.

Reading the question, I though about the Storm framework very recently open sourced by Twitter, which can be considered as "Hadoop for real-time processing".

Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it's fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language.

(from: InfoQ post)

However, I have not worked with it yet, so I really cannot say much about it in practice.

Twitter Engineering Blog Post: http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html
Github: https://github.com/nathanmarz/storm


Given the fact that you want a real-time response in de "seconds" area I recommend something like this:

  1. Setup a batched processing model for pre-computing as much as possible. Essentially try to do everything that does not depend on the "last second" data. Here you can use a regular Hadoop/Mahout setup and run these batches daily or (if needed) every hour or even 15 minutes.

  2. Use a real-time system to do the last few things that cannot be precomputed.For this you should look at either using the mentioned s4 or the recently announced twitter storm.

Sometimes it pays to go really simple and store the precomputed values all in memory and simply do the last aggregation/filter/sorting/... steps in memory. If you can do that you can really scale because each node can run completely independently of all others.

Perhaps having a NoSQL backend for your realtime component helps.There are lot's of those available: mongodb, redis, riak, cassandra, hbase, couchdb, ...

It all depends on your real application.


Also try S4, initially released by Yahoo! and its now Apache Incubator project. It has been around for a while, and I found it to be good for some basic stuff when I did a proof of concept. Haven't used it extensively though.