MapReduce with MongoDB really, really slow (30 hours vs 20 minutes in MySQL for a equivalent database)

mongodb hadoop mapreduce

I've actually answered this very similar question before. The limitations of Map Reduce in MongoDB have been outlined previously - as you mentioned, it is single threaded, it has to be converted to Java Script (spidermonkey) and back etc.

That is why there are other options:

The MongoDB Hadoop Connector (officially supported)
The Aggregation Framework (Requires 2.1+)

As of this writing the 2.2.0 stable release was not yet out, but it was up to RC2, so the release should be imminent. I would recommend giving it a shot as a more meaningful comparison for this type of testing.

mongodb hadoop mapreduce

Apparently using the group function on Aggregation Framework works well! :-)

The following Javascript code gets the 10 most visited domains with their visits in 17m17s!

db.NonFTP_Access_log.aggregate(    { $group: {        _id: "$domain",        visits: { $sum: 1 }        }},    { $sort: { visits: -1 } },    { $limit: 10 }    ).result.forEach(printjson);

Anyway I still don't understand why the MapReduce alternative is so slow. I have opened the following question in the MongoDB JIRA.

mongodb hadoop mapreduce

I think your result quite normal and will try to justify them<.br>1. MySQL is using binary format which is optimized for the processing while MongoDB is working with JSON. So time of parsing is added to the processing. I would estimate it to factor 10x at least.
2. JS is indeed much slower then C. I think at least factor of 10 can be assumed. Together we get about x100 - similar to what you see. 20 minut x 1000 is 2000 minutes or about 33 hours.
3. Hadoop is also not efficient for data processing but it is capable to use all cores you have and it makes difference. Java also has JIT developed and optimized for more then 10 years.
4. I would suggest to look not on MySQL but on TPC-H benchmark Q1 - which is pure aggregation. I think systems like VectorWise will show maximum possible throughput per core.

CodeHunter

MapReduce with MongoDB really, really slow (30 hours vs 20 minutes in MySQL for a equivalent database)

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last