What types/classes of algorithms can be recast in the MapReduce paradigm? What types/classes of algorithms can be recast in the MapReduce paradigm? hadoop hadoop

What types/classes of algorithms can be recast in the MapReduce paradigm?


I am working through these same questions for a collection of big data algorithms that come from the MPI world. Here is my take.

The basic pipeline for MR formulations appears to be an expansion/contraction. The map is applied to a large set, possibly creating an even larger set, and then the reduce is used to sort/organize that set so that it can be aggregated into a consolidated data set, preferably much smaller. The number of maps and reduces you need is the cleverness of the MR algorithm.

As a general computational approach you can solve any computational problem with MR, but from a practical point of view, the resource utilization of MR is skewed in favor of computational problems that have high concurrent I/O requirements. Embarrassingly parallel algorithms like word counting would certainly fit that bill, but it is broader than that, for example, your k-means algorithm is a constraint minimization problem, which nobody would categorize as embarrassingly parallel, but still has an efficient MR formulation.

My current formal framework characterizes a distributed computer system in terms of five attributes:

  1. processor performance
  2. memory capacity (we can ignore memory performance as it tends to be architected in by the processor designers to support the processor's performance)
  3. disk storage capacity
  4. network bandwidth performance
  5. network messaging latency

Disk performance is something I am still struggling with to cleanly incorporate as rotational vs SSD storage technologies have huge performance implications but only if SSDs are integrated via PCIe. If the are integrated via SAS or SATA then you hit the interface limit, and rotational can easily saturate that interface as well. In that case, only the superb latency of SSD will aid in the performance improvement, but that only benefits smaller data sets with smaller data records. So for the moment, let's assume we have a true big data problem and need rotational storage to contain the data cost effectively.

MapReduce uses the above list of distributed resources in that expansion/contraction flow: it uses processor+memory+disk to execute the map functions, and then leans heavily on the performance of the network for the reduce function. As adding servers will scale the processor+memory+disk resource, unfortunately, network performance only modestly increases in capacity but decreases in latency performance. Since network latency is a very difficult performance characteristic to minimize in a distributed system, MR algorithms are most effective for bandwidth-centric operators: that is, algorithms that have billions of little packets that are independent. The commutative and associative attributes Nishant highlights are a perfect summary to identify that class of algorithms as ordering requirements among these packets are greatly simplified and thus simple queuing operators will be sufficient.

I am looking for insights in whether or not there exist efficient MR algorithms for PDE solvers and optimization algorithms, such as integer programming. Found a great graphic from the folks that are doing FutureGrid:Domains of Algorithmic Organization


Map Reduce paradigm is best suited to problems which are "embarrassingly parallel" i.e., there is no dependency between any two tasks. Please check out Embarrassingly Parallel article on Wikipedia.

Also, In cases where the operations are commutative or associative, MapReduce programs can easily be optimized for better performance.