Can someone explain map-reduce in C#? Can someone explain map-reduce in C#? mongodb mongodb

Can someone explain map-reduce in C#?


One way to understand Map-Reduce coming from C# and LINQ is to think of it as a SelectMany() followed by a GroupBy() followed by an Aggregate() operation.

In a SelectMany() you are projecting a sequence but each element can become multiple elements. This is equivalent to using multiple emit statements in your map operation. The map operation can also chose not to call emit which is like having a Where() clause inside your SelectMany() operation.

In a GroupBy() you are collecting elements with the same key which is what Map-Reduce does with the key value that you emit from the map operation.

In the Aggregate() or reduce step you are taking the collections associated with each group key and combining them in some way to produce one result for each key. Often this combination is simply adding up a single '1' value output with each key from the map step but sometimes it's more complicated.

One important caveat with MongoDB's map-reduce is that the reduce operation must accept and output the same data type because it may be applied repeatedly to partial sets of the grouped data. If you are passed an array of values, don't simply take the length of it because it might be a partial result from an earlier reduce operation.


Here's a spot to get started with Map Reduce in Mongo. The cookbook has a few examples, I would focus on these two.

I like to think of map-reduces in the context of "data warehousing jobs" or "rollups". You're basically taking detailed data and "rolling up" a smaller version of that data.

In SQL you would normally do this with sum() and avg() and group by. In MongoDB you would do this with a Map Reduce. The basic premise of a Map Reduce is that you have two functions.

The first function (map) is a basically a giant for loop that runs over your data and "emits" certain keys and values. The second function (reduce), is a giant loop over all of the emitted data. The map says "hey this is the data you want to summarize" and the reduce says "hey this array of values reduces to this single value"

The output from a map-reduce can come in many forms (typically flat files). In MongoDB, the output is actually a new collection.

C# Specifics

In MongoDB all of the Map Reduces are performed inside of the javascript engine. So both the map & reduce function are all written in javascript. The various drivers will allow you to build the javascript and issue the command, however, this is not how I normally do it.

The preferred method for running Map Reduce jobs is to compile the JS into a file and then mongo map_reduce.js. Generally you'll do this on the server somewhere as a cron job or a scheduled task.

Why?

Well, map reduce is not a "real-time", especially with a big data set. It's really designed to be used in a batch fashion. Don't get me wrong, you can call it from your code, but generally, you don't want users to initiate map reduce jobs. Instead you want those jobs to be scheduled and you want users to be querying the results :)


Map Reduce is a way to process data where you have a map stage/function that identifies all data to be processed and process it, row by row.

Then you have a reduce step/function that can be run multiple times, for example once per server in a cluster and then once in the client to return a final result.

Here is a Wiki article describing it in more detail:

http://en.wikipedia.org/wiki/MapReduce

And here is the documentation for MongoDB for Mapreduce

http://www.mongodb.org/display/DOCS/MapReduce

Simple example, find the longest string in a list.

The map step will loop over the list calculating the length of each string, the reduce step will loop over the result from map and for each line keep the longest one.

This can of cause be much more complex but that's the essence of it.