Shipping and using virtualenv in a pyspark job

numpy pyspark virtualenv

With --conf spark.pyspark.{} and export PYSPARK_PYTHON=/usr/local/bin/python2.7 you set options for your local environment / your driver. To set options for the cluster (executors) use the following syntax:

--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON

Furthermore, I guess you should make your virtualenv relocatable (this is experimental, however). <edit 20170908> This means that the virtualenv uses relative instead of absolute links. </edit>

What we did in such cases: we shipped an entire anaconda distribution over hdfs.

<edit 20170908>

If we are talking about different environments (MacOs vs. Linux, as mentioned in the comment below), you cannot just submit a virtualenv, at least not if your virtualenv contains packages with binaries (as is the case with numpy). In that case I suggest you create yourself a 'portable' anaconda, i.e. install Anaconda in a Linux VM and zip it.

Regarding --archives vs. --py-files:

--py-files adds python files/packages to the python path. From the spark-submit documentation:
For Python applications, simply pass a .py file in the place of instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.
--archives means these are extracted into the working directory of each executor (only yarn clusters).

However, a crystal-clear distinction is lacking, in my opinion - see for example this SO post.

In the given case, add the anaconda.zip via --archives, and your 'other python files' via --py-files.

</edit>

See also: Running Pyspark with Virtualenv, a blog post by Henning Kropp.

CodeHunter

Shipping and using virtualenv in a pyspark job

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last