Are there any distributed machine learning libraries for using Python with Hadoop? [closed] Are there any distributed machine learning libraries for using Python with Hadoop? [closed] hadoop hadoop

Are there any distributed machine learning libraries for using Python with Hadoop? [closed]


I do not know of any library that could be used natively in Python for machine learning on Hadoop, but an easy solution would be to use the jpype module, which basically allows you to interact with Java from within your Python code.

You can for example start a JVM like this:

from jpype import *jvm = Nonedef start_jpype():    global jvm    if (jvm is None):        cpopt="-Djava.class.path={cp}".format(cp=classpath)        startJVM(jvmlib,"-ea",cpopt)        jvm="started"

There is a very good tutorial on the topic here, which explains you how to use KMeans clustering from your Python code using Mahout.


Answer to the questions:

  1. To my knowledge, no, python has an extensive collection of machine learning and map-reduce modules but not ML+MR

  2. I would say yes, since you are a heavy programmer you should be able to catch with Java fairly fast if you are not involved with those nasty(sorry no offense) J2EE frameworks


I would recommend using Java, when you are using EMR.

First, and simple, its the way it was designed to work. If your going to play in Windows you write in C#, if your making a web service in apache you use PHP. When your running MapReduce Hadoop in EMR, you use Java.

Second, all the tools are there for you in Java, like AWS SDK. I regularly develop MapReduce jobs in EMR quickly with the help of Netbeans, Cygwin(when on Windows), and s3cmd(in cygwin). I use netbeans to build my MR jar, and cygwin + s3cmd to copy it to my s3 directory to be run be emr. I then also write a program using AWS SDK to launch my EMR cluster with my config and to run my jar.

Third, there are many Hadoop debugging tools(usually need mac or linux os for them to work though) for Java

Please see here for creating a new Netbeans project with maven for hadoop.