Shipping and using virtualenv in a pyspark job Shipping and using virtualenv in a pyspark job numpy numpy

Shipping and using virtualenv in a pyspark job


With --conf spark.pyspark.{} and export PYSPARK_PYTHON=/usr/local/bin/python2.7 you set options for your local environment / your driver. To set options for the cluster (executors) use the following syntax:

--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON

Furthermore, I guess you should make your virtualenv relocatable (this is experimental, however). <edit 20170908> This means that the virtualenv uses relative instead of absolute links. </edit>

What we did in such cases: we shipped an entire anaconda distribution over hdfs.

<edit 20170908>

If we are talking about different environments (MacOs vs. Linux, as mentioned in the comment below), you cannot just submit a virtualenv, at least not if your virtualenv contains packages with binaries (as is the case with numpy). In that case I suggest you create yourself a 'portable' anaconda, i.e. install Anaconda in a Linux VM and zip it.

Regarding --archives vs. --py-files:

  • --py-files adds python files/packages to the python path. From the spark-submit documentation:

    For Python applications, simply pass a .py file in the place of instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.

  • --archives means these are extracted into the working directory of each executor (only yarn clusters).

However, a crystal-clear distinction is lacking, in my opinion - see for example this SO post.

In the given case, add the anaconda.zip via --archives, and your 'other python files' via --py-files.

</edit>

See also: Running Pyspark with Virtualenv, a blog post by Henning Kropp.