how to implement eigenvalue calculation with MapReduce/Hadoop? how to implement eigenvalue calculation with MapReduce/Hadoop? hadoop hadoop

how to implement eigenvalue calculation with MapReduce/Hadoop?


PageRank solves the dominant eigenvector problem by iteratively finding the steady-state discrete flow condition of the network.

If NxM matrix A describes the link weight (amount of flow) from node n to node m, then

p_{n+1} = A . p_{n} 

In the limit where p has converged to a steady state (p_n+1 = p_n), this is an eigenvector problem with eigenvalue 1.

The PageRank algorithm doesn't require the matrix to be held in memory, but is inefficient on dense (non-sparse) matrices. For dense matrices, MapReduce is the wrong solution -- you need locality and broad exchange among nodes -- and you should instead look at LaPACK and MPI and friends.

You can see a working pagerank implementation in the wukong library (hadoop streaming for ruby) or in the Heretrix pagerank submodule. (The heretrix code runs independently of Heretrix)

(disclaimer: I am an author of wukong.)


PREAMBLE:

Given the right sequestration of data, one could achieve parallel computing results without a complete dataset on every machine.

Take for example the following loop:

for (int i = 0; i < m[].length; i++){    for (int j = 0; j < m[i].length; j++)    {        m[i][j]++;     }}

And given a matrix of the following layout:

       j=0   j=1   j=2 i=0  [   ] [   ] [   ] i=1  [   ] [   ] [   ] i=2  [   ] [   ] [   ]

Parallel constructs exist such that the J column can be sent to each computer and the single columns are computed in parallel. The difficult part of parallelization comes when you've got loops that contain dependencies.

for (int i = 0; i < m[].length; i++){    for (int j = 0; j < m[i].length; j++)    {        //For obvious reasons, matrix index verification code removed        m[i][j] = m[i/2][j] + m[i][j+7];     }}

Obviously a loop like the one above becomes extremely problematic (notice the matrix indexers.) But techniques do exist for unrolling these types of loops and creating effective parallel algorithms.

ANSWER:

It is possible that google developed a solution to compute an eigenvalue without maintaining a copy of the matrix on all slave computers. -Or- They used something like Monte Carlo or some other Approximation Algorithm to develop a "close enough" calculation.

In fact, I'd go so far as to say that Google will have gone to as great of lengths as possible to make any calculation required for their PageRank algorithm as efficient as possible. When you're running machines such as these and this (notice the Ethernet cable) you can't be transferring large datasets(100s of gigs) because it is impossible given their hardware limitations of commodity NIC cards.

With that said, Google is good at surprising the programmer community and their implementation could be entirely different.

POSTAMBLE:

Some good resources for parallel computing would include OpenMP and MPI. Both parallel implementations approach parallel computing from very different paradigms, some of which stems from machine implementation (cluster vs. distributed computing.)


I suspect it is intractable for most matrices except those w/ special structures (e.g. sparse matrices or ones w/ certain block patterns). There's way too much coupling between matrix coefficients and eigenvalues.

PageRank uses a very sparse matrix of a special form, and any conclusions from calculating its eigenvalues almost certainly don't extend to general matrices. (edit: here's another reference that looks interesting)