Calculating ranking in Map/Reduce

hadoop couchdb

In CouchDB, map/reduce builds 1-dimensional indexes so that couch can quickly find any information by key.

First, map/reduce builds the copies_purchased view pretty easily, as you say. But the key space is ISBN ID, it is the values that you care about, but they are in no particular order.

For small applications, people simply fetch the entire data set and sort in-memory. That is a great shortcut if you know your requirements; but it does not scale.

A scalable solution is to place these rows into their own database. A second map/reduce can emit keys from copies_purchased and values back to the ISBN. (There is no need for a reduce step.)

Key                 Valuecopies_purchased    ISBN7                   BBBB6                   AAAA3                   CCCC

You can fetch the top N rows, or you can find, e.g., the seventh-ranked book by querying with ?skip=6&limit=1

hadoop couchdb

If the rank is determined by the number of copies sold, then you can build that table using a sql select cursor:

select * from ORDERS orderby copies_purchased desc

And then assign a rank based on the order you retrieve records

while (nextRecord) currRecord.rank = i++;

hadoop couchdb

I'm not sure how you would do this using couchdb. As far as I know, there is no way to directly read couchdb data into hadoop. The closest thing I'm aware of is Brisk, which combines hadoop and cassandra. Its also free.

Alternatively, if it did not have to be up to the minute, you could dump the relevant data to text or sequence files, and use these as your input.

I think you would have to do this in a 2 step process. First, generate the copies purchased, which is basically the word count example that is so common with hadoop.

Since you can relatively easily find out the maximum number of copies purchased by looking at the output of the copies purchased job (this may be a job in itself), you could then create a custom partitioner that will divide the products according to the copies purchased. So if you have 3 reducers, and the max you sell is 600 copies, then reducer 1 takes products selling 0 - 200 copies, reducer 2 takes products selling 201 - 400, and reducer 3 takes prducts selling 401 - 600 copies. Then you can merge the sorted reducer output files, and you then have your sorted list of copies sold.

Or for source code, check out the terasort benchmarks code here. More info about Terasort classes here.

So you end up with a workflow like:

A job to calculate the number of copies sold per product
A job that finds the highest number of copies sold, based on the output of the previous job (though you might be able to skip this step depending on how you implement sort. )
A job that sorts the data and gives you a sorted list of product copies sold. May be outputted over multiple files, so you might need a simple script that merges them together.

For help managing a multi step workflow like this, have a look at Oozie or Cascading.

For more on sorting see this answer.

CodeHunter

Calculating ranking in Map/Reduce

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last