Re-run Spark jobs on Failure or Abort Re-run Spark jobs on Failure or Abort hadoop hadoop

Re-run Spark jobs on Failure or Abort


Just a thought!

Let us call the script file (containing the above script) as run_spark_job.sh.

Try adding these statements at the end of the script:

return_code=$?if [[ ${return_code} -ne 0 ]]; then    echo "Job failed"    exit ${return_code}fiecho "Job succeeded"exit 0

Let us have another script file spark_job_runner.sh, from where we call the above script.For example,

./run_spark_job.shwhile [ $? -ne 0 ]; do    ./run_spark_job.shdone

YARN-based approaches:Update 1: This link will be a good read. It discusses YARN REST API to submit and track: https://community.hortonworks.com/articles/28070/starting-spark-jobs-directly-via-yarn-rest-api.html

Update 2: This link shows how to submit spark application to YARN environment using Java: https://github.com/mahmoudparsian/data-algorithms-book/blob/master/misc/how-to-submit-spark-job-to-yarn-from-java-code.md

Spark-based programmatic approach:

How to use the programmatic spark submit capability

Spark based configuration approach for YARN:

The only spark parameter on YARN mode for restarting is spark.yarn.maxAppAttempts and it should not exceed the YARN resource manager parameter yarn.resourcemanager.am.max-attempts

Excerpt from the official documentation https://spark.apache.org/docs/latest/running-on-yarn.html

The maximum number of attempts that will be made to submit the application.


In yarn mode you can set yarn.resourcemanager.am.max-attempts which is default 2 to re-run the failed job, you can increase as many time you want. Or you can use spark's spark.yarn.maxAppAttempts configuration for same.