MongoDB: Terrible MapReduce Performance MongoDB: Terrible MapReduce Performance mongodb mongodb

MongoDB: Terrible MapReduce Performance


excerpts from MongoDB Definitive Guide from O'Reilly:

The price of using MapReduce is speed: group is not particularly speedy, but MapReduce is slower and is not supposed to be used in “real time.” You run MapReduce as a background job, it creates a collection of results, and then you can query that collection in real time.

options for map/reduce:"keeptemp" : boolean If the temporary result collection should be saved when the connection is closed. "output" : string Name for the output collection. Setting this option implies keeptemp : true. 


Maybe I'm too late, but...

First, you are querying the collection to fill the MapReduce without an index. You shoud create an index on "day".

MongoDB MapReduce is single threaded on a single server, but parallelizes on shards. The data in mongo shards are kept together in contiguous chunks sorted by sharding key.

As your sharding key is "day", and you are querying on it, you probably are only using one of your three servers. Sharding key is only used to spread the data. Map Reduce will query using the "day" index on each shard, and will be very fast.

Add something in front of the day key to spread the data. The username can be a good choice.

That way the Map reduce will be launched on all servers and hopefully reducing the time by three.

Something like this:

use admindb.runCommand( { addshard : "127.20.90.1:10000", name: "M1" } );db.runCommand( { addshard : "127.20.90.7:10000", name: "M2" } );db.runCommand( { enablesharding : "profiles" } );db.runCommand( { shardcollection : "profiles.views", key : {username : 1,day: 1} } );use profilesdb.views.ensureIndex({ hits: -1 });db.views.ensureIndex({ day: -1 });

I think with those additions, you can match MySQL speed, even faster.

Also, better don't use it real time. If your data don't need to be "minutely" precise, shedule a map reduce task every now an then and use the result collection.


You are not doing anything wrong. (Besides sorting on the wrong value as you already noticed in your comments.)

MongoDB map/reduce performance just isn't that great. This is a known issue; see for example http://jira.mongodb.org/browse/SERVER-1197 where a naive approach is ~350x faster than M/R.

One advantage though is that you can specify a permanent output collection name with the out argument of the mapReduce call. Once the M/R is completed the temporary collection will be renamed to the permanent name atomically. That way you can schedule your statistics updates and query the M/R output collection real-time.