MongoDB: Terrible MapReduce Performance

excerpts from MongoDB Definitive Guide from O'Reilly:

The price of using MapReduce is speed: group is not particularly speedy, but MapReduce is slower and is not supposed to be used in “real time.” You run MapReduce as a background job, it creates a collection of results, and then you can query that collection in real time.

options for map/reduce:"keeptemp" : boolean If the temporary result collection should be saved when the connection is closed. "output" : string Name for the output collection. Setting this option implies keeptemp : true.

mongodb mapreduce nosql

Maybe I'm too late, but...

First, you are querying the collection to fill the MapReduce without an index. You shoud create an index on "day".

MongoDB MapReduce is single threaded on a single server, but parallelizes on shards. The data in mongo shards are kept together in contiguous chunks sorted by sharding key.

As your sharding key is "day", and you are querying on it, you probably are only using one of your three servers. Sharding key is only used to spread the data. Map Reduce will query using the "day" index on each shard, and will be very fast.

Add something in front of the day key to spread the data. The username can be a good choice.

That way the Map reduce will be launched on all servers and hopefully reducing the time by three.

Something like this:

use admindb.runCommand( { addshard : "127.20.90.1:10000", name: "M1" } );db.runCommand( { addshard : "127.20.90.7:10000", name: "M2" } );db.runCommand( { enablesharding : "profiles" } );db.runCommand( { shardcollection : "profiles.views", key : {username : 1,day: 1} } );use profilesdb.views.ensureIndex({ hits: -1 });db.views.ensureIndex({ day: -1 });

I think with those additions, you can match MySQL speed, even faster.

Also, better don't use it real time. If your data don't need to be "minutely" precise, shedule a map reduce task every now an then and use the result collection.

mongodb mapreduce nosql

You are not doing anything wrong. (Besides sorting on the wrong value as you already noticed in your comments.)

MongoDB map/reduce performance just isn't that great. This is a known issue; see for example http://jira.mongodb.org/browse/SERVER-1197 where a naive approach is ~350x faster than M/R.

One advantage though is that you can specify a permanent output collection name with the out argument of the mapReduce call. Once the M/R is completed the temporary collection will be renamed to the permanent name atomically. That way you can schedule your statistics updates and query the M/R output collection real-time.

CodeHunter

MongoDB: Terrible MapReduce Performance

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last