It's possible to configure the Beam portable runner with the spark configurations?
I have three solutions to choose from depending on your deployment requirements. In order of difficulty:
- Use the Spark "uber jar" job server. This starts an embedded job server inside the Spark master, instead of using a standalone job server in a container. This would simplify your deployment a lot, since you would not need to start the
beam_spark_job_server
container at all.
python -m apache_beam.examples.wordcount \--output ./data_test/ \--runner=SparkRunner \--spark_submit_uber_jar \--spark_master_url=spark://spark-master:7077 \--environment_type=LOOPBACK
You can pass the properties through a Spark configuration file. Create the Spark configuration file, and add
spark.driver.host
and whatever other properties you need. In thedocker run
command for the job server, mount that configuration file to the container, and set theSPARK_CONF_DIR
environment variable to point to that directory.If that neither of those work for you, you can alternatively build your own customized version of the job server container. Pull Beam source from Github. Check out the release branch you want to use (e.g.
git checkout origin/release-2.28.0
). Modify the entrypoint spark-job-server.sh and set-Dspark.driver.host=x
there. Then build the container using./gradlew :runners:spark:job-server:container:docker -Pdocker-repository-root="your-repo" -Pdocker-tag="your-tag"
.
Let me revise the answer. The Job server need to able to communicate with the workers vice verse. The error of keep exiting is due to this. You need to configure such that they can communicate. A k8s headless service able to solve this.
reference of workable example at https://github.com/cometta/python-apache-beam-spark . If it is useful for you, can help me to 'Star' the repository