Pyspark --py-files doesn't work Pyspark --py-files doesn't work hadoop hadoop

Pyspark --py-files doesn't work


Try this function of SparkContext

sc.addPyFile(path)

According to pyspark documentation here

Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

Try upload your python module file to a public cloud storage (e.g. AWS S3) and pass the URL to that method.

Here is a more comprehensive reading material: http://www.cloudera.com/documentation/enterprise/5-5-x/topics/spark_python.html


Try to import your custom module from inside the method itself rather than at the top of the driver script, e.g.:

def parse_record(record):    import parser    p = parser.parse(record)    return p

rather than

import parserdef parse_record(record):    p = parser.parse(record)    return p

Cloud Pickle doesn't seem to recognise when a custom module has been imported, so it seems to try to pickle the top-level modules along with the other data that's needed to run the method. In my experience, this means that top-level modules appear to exist, but they lack usable members, and nested modules can't be used as expected. Once either importing with from A import * or from inside the method (import A.B), the modules worked as expected.


It sounds like one or more of the nodes aren't configured properly. Do all of the nodes on the cluster have the same version/configuration of Python (i.e. they all have the parser module installed)?

If you don't want to check one-by-one you could write a script to check if it is installed/install it for you. This thread shows a few ways to do that.