MapReduce and SQL GROUP BY MapReduce and SQL GROUP BY hadoop hadoop

MapReduce and SQL GROUP BY


What you get by using MR is speed. GROUP BY is a slow operation in SQL and MR is even slower in MongoDB. But what you do is that you create new collections and iterate over them in real time. This is very good when you have large amounts of data and want to be able to iterate over it in real time.

In the project I'm working on there is a Python script running in the background (cron job) doing different map/reduces once per day. Instead of iterating over large tables with SQL group by, we iterate once with MR and then iterate fast on the new collections created.

I have no experience in Hadoop. So I'm sorry I can't fill you in there.

Tutorial:http://www.mongovue.com/2010/11/03/yet-another-mongodb-map-reduce-tutorial/

EDIT:

Here you may see an entire translation of an SQL query to a MongoDB Map/Reduce:GROUP BY to MongoDB Map/ReduceIt's taken from: http://rickosborne.org/download/SQL-to-MongoDB.pdf


A lot of folk use MongoDB as the data storage and Hadoop for processing as there's connector between the two. Each MongoDB node can handle multiple Hadoop nodes reading into it. As a note, I'd recommend is separating mongo and Hadoop nodes for memory.

In case you don't have them, here's some documents for you

One other thing that might be worth looking at is the new aggregation framework coming out in 2.2. Here's chart equating the operations in SQL with those in the MongoDB aggregation framework.