Can someone explain parallelpython versus hadoop for distributing python process across various servers? Can someone explain parallelpython versus hadoop for distributing python process across various servers? hadoop hadoop

Can someone explain parallelpython versus hadoop for distributing python process across various servers?


Since moving data becomes harder and harder with size; when it comes to parallel computing, data localization becomes very important. Hadoop as a map/reduce framework maximizes the localization of data being processed. It also gives you a way to spread your data efficiently across your cluster (hdfs). So basically, even if you use other parallel modules, as long as you don't have your data localized on the computers you are doing process or as long as you have to move your data across cluster all the time, you wouldn't get maximum benefit from parallel computing. That's one of the key ideas of hadoop.


The main difference is that Hadoop is good at processing big data (dozen to terabytes of data). It provides a simple logical framework called MapReduce which is well appropriate for data aggregation, and a distributed storage system called HDFS.

If your inputs are smaller than 1 gigabyte, you probably don't want to use Hadoop.