Re-run Spark jobs on Failure or Abort
Just a thought!
Let us call the script file (containing the above script) as run_spark_job.sh
.
Try adding these statements at the end of the script:
return_code=$?if [[ ${return_code} -ne 0 ]]; then echo "Job failed" exit ${return_code}fiecho "Job succeeded"exit 0
Let us have another script file spark_job_runner.sh
, from where we call the above script.For example,
./run_spark_job.shwhile [ $? -ne 0 ]; do ./run_spark_job.shdone
YARN-based approaches:Update 1: This link will be a good read. It discusses YARN REST API to submit and track: https://community.hortonworks.com/articles/28070/starting-spark-jobs-directly-via-yarn-rest-api.html
Update 2: This link shows how to submit spark application to YARN environment using Java: https://github.com/mahmoudparsian/data-algorithms-book/blob/master/misc/how-to-submit-spark-job-to-yarn-from-java-code.md
Spark-based programmatic approach:
How to use the programmatic spark submit capability
Spark based configuration approach for YARN:
The only spark parameter on YARN mode for restarting is spark.yarn.maxAppAttempts
and it should not exceed the YARN resource manager parameter yarn.resourcemanager.am.max-attempts
Excerpt from the official documentation https://spark.apache.org/docs/latest/running-on-yarn.html
The maximum number of attempts that will be made to submit the application.
In yarn mode you can set yarn.resourcemanager.am.max-attempts which is default 2 to re-run the failed job, you can increase as many time you want. Or you can use spark's spark.yarn.maxAppAttempts configuration for same.