Remotely execute a Spark job on an HDInsight cluster

azure apache-spark remote-access azure-hdinsight

Spark-jobserver provides a RESTful interface for submitting and managing Apache Spark jobs, jars, and job contexts.

https://github.com/spark-jobserver/spark-jobserver

My solution is using both Scheduler and Spark-jobserver to launch the Spark-job periodically.

azure apache-spark remote-access azure-hdinsight

At the moment of this writing, it seems there is no official way of achieving this. So far, however, I have been able to somehow remotely run Spark jobs using an Oozie shell workflow. It is nothing but a patch, but so far it has been useful for me. These are the steps I have followed:

Prerequisites

Microsoft Powershell
Azure Powershell

Process

Define an Oozie workflow .xml file:

<workflow-app name="myWorkflow" xmlns="uri:oozie:workflow:0.2">  <start to = "myAction"/>  <action name="myAction">        <shell xmlns="uri:oozie:shell-action:0.2">            <job-tracker>${jobTracker}</job-tracker>            <name-node>${nameNode}</name-node>            <configuration>                <property>                    <name>mapred.job.queue.name</name>                    <value>${queueName}</value>                </property>            </configuration>            <exec>myScript.cmd</exec>            <file>wasb://myContainer@myAccount.blob.core.windows.net/myScript.cmd#myScript.cmd</file>            <file>wasb://myContainer@myAccount.blob.core.windows.net/mySpark.jar#mySpark.jar</file>        </shell>        <ok to="end"/>        <error to="fail"/>    </action>    <kill name="fail">        <message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>    </kill>    <end name="end"/></workflow-app>

Note that it is not possible to identify on which HDInsight node is going to be executed the script, so it is necessary to put it, along with the Spark application .jar, on the wasb repository. It is then redirectioned to the local directory on which the Oozie job is executing.

Define the custom script

C:\apps\dist\spark-1.2.0\bin\spark-submit --class spark.azure.MainClass                                          --master yarn-cluster                                           --deploy-mode cluster                                           --num-executors 3                                           --executor-memory 2g                                           --executor-cores 4                                           mySpark.jar

It is necessary to upload both the .cmd and the Spark .jar to the wasb repository (a process that it is not included in this answer), concretely to the direction pointed in the workflow:

wasb://myContainer@myAccount.blob.core.windows.net/

Define the powershell script

The powershell script is very much taken from the official Oozie on HDInsight tutorial. I am not including the script on this answer due to its almost absolute sameness with my approach.

I have made a new suggestion on the azure feedback portal indicating the need of official support for remote Spark job submission.

azure apache-spark remote-access azure-hdinsight

Updated on 8/17/2016:Our spark cluster offering now includes a Livy server that provides a rest service to submit a spark job. You can automate spark job via Azure Data Factory as well.

Original post: 1) Remote job submission for spark is currently not supported.

2) If you want to automate setting a master every time ( i.e. adding --master yarn-client every time you execute), you can set the value in %SPARK_HOME\conf\spark-defaults.conf file with following config:

spark.master yarn-client

You can find more info on spark-defaults.conf on apache spark website.

3) Use cluster customization feature if you want to add this automatically to spark-defaults.conf file at deployment time.

CodeHunter

Remotely execute a Spark job on an HDInsight cluster

Prerequisites

Process

Define an Oozie workflow .xml file:

Define the custom script

Define the powershell script

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last

Remotely execute a Spark job on an HDInsight cluster

Prerequisites

Process

Define an Oozie workflow *.xml* file:

Define the custom script

Define the powershell script

Recent Posts

Define an Oozie workflow .xml file: