MapReduce with MongoDB really, really slow (30 hours vs 20 minutes in MySQL for a equivalent database) MapReduce with MongoDB really, really slow (30 hours vs 20 minutes in MySQL for a equivalent database) hadoop hadoop

MapReduce with MongoDB really, really slow (30 hours vs 20 minutes in MySQL for a equivalent database)


I've actually answered this very similar question before. The limitations of Map Reduce in MongoDB have been outlined previously - as you mentioned, it is single threaded, it has to be converted to Java Script (spidermonkey) and back etc.

That is why there are other options:

  1. The MongoDB Hadoop Connector (officially supported)
  2. The Aggregation Framework (Requires 2.1+)

As of this writing the 2.2.0 stable release was not yet out, but it was up to RC2, so the release should be imminent. I would recommend giving it a shot as a more meaningful comparison for this type of testing.


Apparently using the group function on Aggregation Framework works well! :-)

The following Javascript code gets the 10 most visited domains with their visits in 17m17s!

db.NonFTP_Access_log.aggregate(    { $group: {        _id: "$domain",        visits: { $sum: 1 }        }},    { $sort: { visits: -1 } },    { $limit: 10 }    ).result.forEach(printjson);

Anyway I still don't understand why the MapReduce alternative is so slow. I have opened the following question in the MongoDB JIRA.


I think your result quite normal and will try to justify them<.br>1. MySQL is using binary format which is optimized for the processing while MongoDB is working with JSON. So time of parsing is added to the processing. I would estimate it to factor 10x at least.
2. JS is indeed much slower then C. I think at least factor of 10 can be assumed. Together we get about x100 - similar to what you see. 20 minut x 1000 is 2000 minutes or about 33 hours.
3. Hadoop is also not efficient for data processing but it is capable to use all cores you have and it makes difference. Java also has JIT developed and optimized for more then 10 years.
4. I would suggest to look not on MySQL but on TPC-H benchmark Q1 - which is pure aggregation. I think systems like VectorWise will show maximum possible throughput per core.