Parallelizing Ruby reducers in Hadoop?

java ruby hadoop mapreduce

Reducers will always run in parallel, whether you're using streaming or not (if you're not seeing this, verify that the job configuration is set to allow multiple reduce tasks -- see mapred.reduce.tasks in your cluster or job configuration). The difference is that the framework packages things up a little more nicely for you when you use Java versus streaming.

For Java, the reduce task gets an iterator over all the values for a particular key. This makes it easy to walk the values if you are, say, summing the map output in your reduce task. In streaming, you literally just get a stream of key-value pairs. You are guaranteed that the values will be ordered by key, and that for a given key will not be split across reduce tasks, but any state tracking you need is up to you. For example, in Java your map output comes to your reducer symbolically in the form

key1, {val1, val2, val3}key2, {val7, val8}

With streaming, your output instead looks like

key1, val1key1, val2key1, val3key2, val7key2, val8

For example, to write a reducer that computes the sum of the values for each key, you'll need a variable to store the last key you saw and a variable to store the sum. Each time you read a new key-value pair, you do the following:

check if the key is different than the last key.
if so, output your key and current sum, and reset the sum to zero.
add the current value to your sum and set last key to the current key.

HTH.

java ruby hadoop mapreduce

I haven't tried Hadoop Streaming myself but from reading the docs I think you can achieve similar parallel behavior.

Instead of passing a key with the associated values to each reducer, streaming will group the mapper output by keys. It also guarantees that values with the same keys won't be split over multiple reducers. This is somewhat different from normal Hadoop functionality, but even so, the reduce work will be distributed over multiple reducers.

Try to use the -verbose option to get more information about what's really going on. You can also try to experiment with the -D mapred.reduce.tasks=X option where X is the desired number of reducers.

CodeHunter

Parallelizing Ruby reducers in Hadoop?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last