How does Spark interoperate with CPython

scala pandas apache-spark interop pyspark

PySpark architecture is described here https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals.

PySpark internals

As @Holden said Spark uses py4j to access Java objects in JVM from the python. But this is only one case - when driver program is written in python (left part of diagram there)

The other case (the right part of the diagram) - when Spark Worker starts Python process and sends serialized Java objects to python program to be processed, and receives output. Java objects are serialized into pickle format - so python could read them.

Looks like what you are looking for is the latter case. Here some links to the Spark's scala core that could be useful for you to get started:

Pyrolite library that provides Java interface to Python's pickle protocols - used by Spark to serialize Java objects into pickle format. For example such conversion is required for accessing Key part of Key, Value pairs for the PairRDD.
Scala code that starts python process and iterates with it: api/python/PythonRDD.scala
SerDeser utils that do picking of the code: api/python/SerDeUtil.scala
Python side: python/pyspark/worker.py

scala pandas apache-spark interop pyspark

So Spark uses py4j to communicate between the JVM and Python. This allow Spark to work with different versions of Python but requires serializing data from the JVM and vice versa to communicate. There is more info on py4j at http://py4j.sourceforge.net/ , hope that helps :)

CodeHunter

How does Spark interoperate with CPython

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last