What is the correct way to start/stop spark streaming jobs in yarn?

hadoop apache-spark spark-streaming hadoop-yarn cloudera

You can close the spark-submit console. The job is running in background already when writes out RUNNING state.
Logs are visible just after the application completes. During runtime all logs are accessible directly at worker nodes locally (you can see at YARN resource manager web UI) and are aggregated to HDFS after the job finishes.
yarn application -kill is probably the best way how to stop Spark streaming application, but it's not perfect. It would be better to do some graceful shutdown to stop all stream receivers and stop streaming context, but I personally don't know how to do it.

hadoop apache-spark spark-streaming hadoop-yarn cloudera

I finally figure a way to safely close spark streaming job.

write a socket server thread wait for stop the streaming context

    package xxx.xxx.xxx    import java.io.{BufferedReader, InputStreamReader}    import java.net.{ServerSocket, Socket}    import org.apache.spark.streaming.StreamingContext    object KillServer {      class NetworkService(port: Int, ssc: StreamingContext) extends Runnable {        val serverSocket = new ServerSocket(port)        def run() {          Thread.currentThread().setName("Zhuangdy | Waiting for graceful stop at port " + port)          while (true) {            val socket = serverSocket.accept()            (new Handler(socket, ssc)).run()          }        }      }      class Handler(socket: Socket, ssc: StreamingContext) extends Runnable {        def run() {          val reader = new InputStreamReader(socket.getInputStream)          val br = new BufferedReader(reader)          if (br.readLine() == "kill") {            ssc.stop(true, true)          }          br.close();        }      }      def run(port:Int, ssc: StreamingContext): Unit ={        (new NetworkService(port, ssc)).run      }    }

at your main method where you start streaming context, add following code
```
ssc.start()KillServer.run(11212, ssc)ssc.awaitTermination()
```
Write spark-submit to submit jobs to yarn, and direct output to a file which you will use later

    spark-submit --class "com.Mainclass" \                    --conf "spark.streaming.stopGracefullyOnShutdown=true" \                    --master yarn-cluster  --queue "root"  \                    --deploy-mode cluster \            --executor-cores 4 --num-executors 8 --executor-memory 3G \            hdfs:///xxx.jar > output 2>&1 &

Finally, safely shutdown spark streaming job without data loss or compute result not persist!!! (The server socket which is using to stop streaming context gracefully is running on the driver, so you grep the output of step 3 to get the driver addr, and using echo nc to send a socket kill command)

    #!/bin/bash    driver=`cat output | grep ApplicationMaster | grep -Po '\d+.\d+.\d+.\d+'`    echo "kill" | nc $driver 11212    driverid=`yarn application -list 2>&1 | grep ad.Stat | grep -Po 'application_\d+_\d+'`    yarn application -kill $driverid

hadoop apache-spark spark-streaming hadoop-yarn cloudera

What is your data source? If it is reliable, like Kafka direct receiver, the yarn kill shutdown should be fine. When your application restart, it will read from the last complete batch offset. If the data source is not reliable, or if you want to handle a graceful shutdown yourself, you have to implement some kind of external hook on the streaming context. I faced the same problem and I ended up implementing a small hack to add a new tab in the webui that acts as a stop button.

CodeHunter

What is the correct way to start/stop spark streaming jobs in yarn?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last