distributed scheduling system for R scripts distributed scheduling system for R scripts hadoop hadoop

distributed scheduling system for R scripts


If what you are wanting to do is distribute jobs for parallel execution on machines you have physical access to, I HIGHLY recommend the doRedis backend for foreach. You can read the vignette PDF to get more details. The gist is as follows:

Why write a doRedis package? After all, the foreach package already has available many parallel back end packages, including doMC, doSNOW and doMPI. The doRedis package allows for dynamic pools of workers. New workers may be added at any time, even in the middle of running computations. This feature is relevant, for example, to modern cloud computing environments. Users can make an economic decision to \turn on" more computing resources at any time in order to accelerate running computations. Similarly, modernThe doRedis Package cluster resource allocation systems can dynamically schedule R workers as cluster resources become available

Hadoop works best if the machines running Hadoop are dedicated to the cluster, and not borrowed. There's also considerable overhead to setting up Hadoop which can be worth the effort if you need the map/reduce algo and distributed storage provided by Hadoop.

So what, exactly is your configuration? Do you have an office full of machines you're wanting to distribute R jobs on? Do you have a dedicated cluster? Is this going to be EC2 or other "cloud" based?

The devil is in the details, so you can get better answers if the details are explicit.

If you want the workers to do jobs and have the results of the jobs reconfigured back in one master node, you'll be much better off using a dedicated R solution and not a system like TakTuk or dsh which are more general parallelization tools.


Look into TakTuk and dsh as starting points. You could perhaps roll your own mechanism with pssh or clusterssh, though these may be more effort.