Data Mining Library for MPI Data Mining Library for MPI hadoop hadoop

Data Mining Library for MPI


There is no reason why MPI (which is a concept, not a software itself!) necessarily is easier to install than Hadoop/Mahout. Indeed, the latter two currently are a mess, in particular because of their Java library chaos. Apache Bigtop tries to make them easier to install, and once you've figured out some basics it's quite ok.

However:

  • If your data is small (i.e. it can be processed on a single node), don't install a cluster solution, you pay for the overhead. Hadoop does not make much sense on single hosts. Use Weka, ELKI, RapidMiner, KNIME or whatever.
  • If your data is large, you will want to minimize data transfer. And this is where the strength of Hadoop/Mahout lies, minimizing data transfer. A typical message passing API cannot scale the same way for data-heavy operations.

There are some efforts such as Apache Hama that are quite similar to MPI stuff IMHO. It is based on messages, however they are bulk-processed via barrier synchronization. It might also have some message aggregation prior to sending to reduce traffic.


I strongly recommend graphlab. Currently graphlab, a Distributed Graph-Parallel API, has toolkits including

  • topic modeling
  • collaborative filtering
  • clustering
  • graphical model

http://docs.graphlab.org/toolkits.html

GraphLab is a graph-based, high performance, distributed computation framework written in C++. While GraphLab was originally developed for Machine Learning tasks, it has found great success at a broad range of other data-mining tasks; out-performing other abstractions by orders of magnitude.

GraphLab Features:

A unified multicore and distributed API: write once run efficiently in both shared and distributed memory systemsTuned for performance: optimized C++ execution engine leverages extensive multi-threading and asynchronous IOScalable: GraphLab intelligently places data and computation using sophisticated new algorithmsHDFS Integration: Access your data directly from HDFSPowerful Machine Learning Toolkits: Turn BigData into actionable knowledge with ease


this idea doesn't make sense and I think you have some misconceptions, MPI is more for tightly coupled systems and i'm 99% sure won't send messages to an external location, you can however process or analyze the data with MPI much more quickly (depending on your hardware). My 2 cents is that you are better off using one of the AMQP protocol open source implementations ,I would say zeromq is your best bet and then processing all the data you get in R or python or if your data set is very very large MPI. Another option is that you can call serial libraries on different machines connected and running MPI given they all are connected to the internet seperately. R is real easy to call with MPI so is python.