Is is possible to use a Conda environment as "virtualenv" for a Hadoop Streaming Job (in Python)? Is is possible to use a Conda environment as "virtualenv" for a Hadoop Streaming Job (in Python)? hadoop hadoop

Is is possible to use a Conda environment as "virtualenv" for a Hadoop Streaming Job (in Python)?


I don't know a way of packaging a conda environment in a tar/zip for then untar it in a different box and get it ready to use like in the example you mention, that might not be possible. At least not without Anaconda in all the worker nodes, there might be also issues moving between different OS.

Anaconda Cluster was created to solve that problem (Disclaimer: I am an Anaconda Cluster developer) but it is uses a more complicated approach, basically we use a configuration management system (salt) to install anaconda in all the nodes in the cluster and control the conda environments.

We use a configuration management system because we also deploy the hadoop stack (spark and its friends) and we need to target big clusters, but in reality if you only need to deploy anaconda and have not to many nodes you should be able to do that just with fabric (that Anaconda Cluster also uses in some parts) and run it on a regular laptop.

If you are interested Anaconda Cluster docs are here: http://continuumio.github.io/conda-cluster/


Update 2019:

The answer is yes and the way of doing it is using conda-pack

https://conda.github.io/conda-pack/