Calculating ranking in Map/Reduce Calculating ranking in Map/Reduce hadoop hadoop

Calculating ranking in Map/Reduce


In CouchDB, map/reduce builds 1-dimensional indexes so that couch can quickly find any information by key.

First, map/reduce builds the copies_purchased view pretty easily, as you say. But the key space is ISBN ID, it is the values that you care about, but they are in no particular order.

For small applications, people simply fetch the entire data set and sort in-memory. That is a great shortcut if you know your requirements; but it does not scale.

A scalable solution is to place these rows into their own database. A second map/reduce can emit keys from copies_purchased and values back to the ISBN. (There is no need for a reduce step.)

Key                 Valuecopies_purchased    ISBN7                   BBBB6                   AAAA3                   CCCC

You can fetch the top N rows, or you can find, e.g., the seventh-ranked book by querying with ?skip=6&limit=1


If the rank is determined by the number of copies sold, then you can build that table using a sql select cursor:

select * from ORDERS orderby copies_purchased desc

And then assign a rank based on the order you retrieve records

while (nextRecord) currRecord.rank = i++;


I'm not sure how you would do this using couchdb. As far as I know, there is no way to directly read couchdb data into hadoop. The closest thing I'm aware of is Brisk, which combines hadoop and cassandra. Its also free.

Alternatively, if it did not have to be up to the minute, you could dump the relevant data to text or sequence files, and use these as your input.

I think you would have to do this in a 2 step process. First, generate the copies purchased, which is basically the word count example that is so common with hadoop.

Since you can relatively easily find out the maximum number of copies purchased by looking at the output of the copies purchased job (this may be a job in itself), you could then create a custom partitioner that will divide the products according to the copies purchased. So if you have 3 reducers, and the max you sell is 600 copies, then reducer 1 takes products selling 0 - 200 copies, reducer 2 takes products selling 201 - 400, and reducer 3 takes prducts selling 401 - 600 copies. Then you can merge the sorted reducer output files, and you then have your sorted list of copies sold.

Or for source code, check out the terasort benchmarks code here. More info about Terasort classes here.

So you end up with a workflow like:

  1. A job to calculate the number of copies sold per product
  2. A job that finds the highest number of copies sold, based on the output of the previous job (though you might be able to skip this step depending on how you implement sort. )
  3. A job that sorts the data and gives you a sorted list of product copies sold. May be outputted over multiple files, so you might need a simple script that merges them together.

For help managing a multi step workflow like this, have a look at Oozie or Cascading.

For more on sorting see this answer.