Is it possible to run Python's scikit-learn algorithms over Hadoop? [closed] Is it possible to run Python's scikit-learn algorithms over Hadoop? [closed] hadoop hadoop

Is it possible to run Python's scikit-learn algorithms over Hadoop? [closed]


Short answer: YES. Because you can run almost everything on Hadoop.

Long answer: it depends. Answer to this question for a start:

  • Can you split your dataset into partitions?

Also, you may find this presentation useful (Hadoop is starting at 73'rd slide).


Look out for jpype module. By using jpype you can run Mahout Algorithms and you will be writing code in Python. However I feel this won't be the best of solution. If you really want massive scalability than go with Mahout directly. I practice, do POC's, solve toy problems using scikit-learn, however when I need to do massive big data clustering and so on than I go Mahout.