Do exit codes and exit statuses mean anything in spark?

hadoop apache-spark pyspark spark-dataframe hadoop-yarn

Neither exit codes and status nor signals are Spark specific but part of the way processes work on Unix-like systems.

Exit status and exit code

Exit status and exit codes are different names for the same thing. An exit status is a number between 0 and 255 which indicates the outcome of a process after it terminated. Exit status 0 usually indicates success. The meaning of the other codes is program dependent and should be described in the program's documentation. There are some established standard codes, though. See this answer for a comprehensive list.

Exit codes used by Spark

In the Spark sources I found the followingexit codes. Their descriptions are taken from log statements and comments in the code and from my understanding of the code where the exit status appeared.

Spark SQL CLI Driver in Hive Thrift Server

3: if an UnsupportedEncodingException occurred when setting up stdout and stderr streams.

Spark/Yarn

10: if an uncaught exception occurred
11: if more than spark.yarn.scheduler.reporterThread.maxFailures executor failures occurred
12: if the reporter thread failed with an exception
13: if the program terminated before the user had initialized the spark context or if the spark context did not initialize before a timeout.
14: This is declared as EXIT_SECURITY but never used
15: if a user class threw an exception
16: if the shutdown hook called before final status was reported. A comment in the source code explains the expected behaviour of user applications:
The default state of ApplicationMaster is failed if it is invoked by shut down hook. This behavior is different compared to 1.x version. If user application is exited ahead of time by calling System.exit(N), here mark this application as failed with EXIT_EARLY. For a good shutdown, user shouldn't call System.exit(0) to terminate the application.

Executors

50: The default uncaught exception handler was reached
51: The default uncaught exception handler was called and an exception was encountered while logging the exception
52: The default uncaught exception handler was reached, and the uncaught exception was an OutOfMemoryError
53: DiskStore failed to create local temporary directory after many attempts (bad spark.local.dir?)
54: ExternalBlockStore failed to initialize after many attempts
55: ExternalBlockStore failed to create a local temporary directory after many attempts
56: Executor is unable to send heartbeats to the driver more than "spark.executor.heartbeat.maxFailures" times.
101: Returned by spark-submit if the child main class was not found. In client mode (command line option --deploy-mode client) the child main class is the user submitted application class (--class CLASS). In cluster mode (--deploy-mode cluster) the child main class is the cluster manager specific submission/client class.

Exit codes greater than 128

These exit codes most likely result from a program shutdown triggered bya Unix signal. The signal number can be calculated by substracting 128 from the exit code. This is explained in more details in this blog post (which was originally linked in this question). There is also a good answer explaining JVM-generated exit codes. Spark works with this assumption as explained in a comment in ExecutorExitCodes.scala

Other exit codes

Apart from the exit codes listed above there are number of System.exit() calls in the Spark sources setting 1 or -1 as exit code. As far as I an tell -1 seems to be used to indicate missing or incorrect command line parameters while 1 indicates all other errors.

Signals

Signals are a kind of events which allow to send system messages to a process. These messages are used to ask a process to reload its configuration (SIGHUP) or to terminate itself (SIGKILL), for instance. A list of standard signals can be found in the signal(7) man page in section Standard Signals.

As explained by Rick Moritz in the comments below (thank you!), the most likely sources of signals in a Spark setup are

the cluster resource manager when the container size exceeded, the job finished, a dynamic scale-down was made, or a job was aborted by the user
the operating system: as part of a controlled system shut down or if some resource limit was hit (out of memory, over hard quota, no space left on disk etc.)
a local user who killed a job

I hope this makes it a bit clearer what these messages by spark might mean.

CodeHunter

Do exit codes and exit statuses mean anything in spark?

Exit status and exit code

Exit codes used by Spark

Spark SQL CLI Driver in Hive Thrift Server

Spark/Yarn

Executors

Exit codes greater than 128

Other exit codes

Signals

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last