Pyspark --py-files doesn't work

Try this function of SparkContext

sc.addPyFile(path)

According to pyspark documentation here

Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

Try upload your python module file to a public cloud storage (e.g. AWS S3) and pass the URL to that method.

Here is a more comprehensive reading material: http://www.cloudera.com/documentation/enterprise/5-5-x/topics/spark_python.html

python hadoop apache-spark emr

Try to import your custom module from inside the method itself rather than at the top of the driver script, e.g.:

def parse_record(record):    import parser    p = parser.parse(record)    return p

rather than

import parserdef parse_record(record):    p = parser.parse(record)    return p

Cloud Pickle doesn't seem to recognise when a custom module has been imported, so it seems to try to pickle the top-level modules along with the other data that's needed to run the method. In my experience, this means that top-level modules appear to exist, but they lack usable members, and nested modules can't be used as expected. Once either importing with from A import * or from inside the method (import A.B), the modules worked as expected.

python hadoop apache-spark emr

It sounds like one or more of the nodes aren't configured properly. Do all of the nodes on the cluster have the same version/configuration of Python (i.e. they all have the parser module installed)?

If you don't want to check one-by-one you could write a script to check if it is installed/install it for you. This thread shows a few ways to do that.

CodeHunter

Pyspark --py-files doesn't work

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last