Hadoop Matrix Multiplication

java hadoop cloud mapreduce

In TestMatrixMultiply.java, from the web site you linked, which presumably contains the code you're using to encode your matrices into the expected IndexPair-backed file format, there is the function writeMatrix:

public static void writeMatrix (int[][] matrix, int rowDim, int colDim, String pathStr)    throws IOException{    Path path = new Path(pathStr);    SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, path,         MatrixMultiply.IndexPair.class, IntWritable.class,         SequenceFile.CompressionType.NONE);    MatrixMultiply.IndexPair indexPair = new MatrixMultiply.IndexPair();    IntWritable el = new IntWritable();    for (int i = 0; i < rowDim; i++) {        for (int j = 0; j < colDim; j++) {            int v = matrix[i][j];            if (v != 0) { // !!! well, that would be why we aren't writing 0s                indexPair.index1 = i;                indexPair.index2 = j;                el.set(v);                writer.append(indexPair, el);            }        }    }    writer.close();}

Comment inserted in the second line of the inner for loop.

Your mapper is reading no 0s because your input files contain no 0s.

The code is heavily designed to assume that all missing values are 0, and it performs additional checks to avoid emitting 0s, to attempt to minimize network traffic.

Everything below here is mostly wrong because I misunderstood the algorithm
(the part above is still useful, though)

From the linked page, you're using strategy 3. Strategy 3 relies on partitioner behavior and sort order. Unfortunately, the partitioner is wrong! This is separate from the 0s not being printed out. The partitioner is just straight-up wrong here, and you're getting matrices full of 0s because it's multiplying by data that was previously initialized to 0 and never overwritten with the correct data for the block. This is hidden on 1-machine operation, because the partitioner is a null operation, but breaks in most clusters.

The partitioner maps intermediate key (kb, jb, ib) to a reducer r as follows:
r = (jb*KB + kb) mod R

It needs to guarantee that all keys for the same block are assigned to the same reducer. Unfortunately, it guarantees that this will not happen unless KB % numReducers == 0:

map (key, value)   if from matrix A with key=(i,k) and value=a(i,k)      for 0 <= jb < NJB         emit (k/KB, jb, i/IB), (i mod IB, k mod KB, a(k,j)) // compare this...   if from matrix B with key=(k,j) and value=b(k,j)       emit (k/KB, j/JB, -1), (k mod KB, j mod KB, b(k,j))  // ...to this

For matrix A, key jb is being iterated. For matrix B, key jb is being calculated. Since assignment to a reducer is round-robin, this guarantees that the A-matrix keys will not be assigned to the same reducer as the B-matrix keys. Therefore, the algorithm fails, since it assumes something about assignment and ordering of keys that is provably incorrect. It's correct about key order if and only if there is one reducer to which all keys are assigned, but the partitioner is wrong!

The partitioner must be modified to use kb % numReducers for Strategy 3. This is not a very good partitioner, but it is the only partitioner that will work because non-identical keys are required to be sorted to the same reducer in a particular order.

The code that you have actually put inside the question is unrelated to where the bug actually exists.

CodeHunter

Hadoop Matrix Multiplication

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last