How to submit Apache Spark job to Hadoop YARN on Azure HDInsight

azure apache-spark azure-hdinsight

You can install spark on a hdinsight cluster. You have to do it at by creating a custom cluster and adding an action script that will install Spark on the cluster at the time it creates the VMs for the Cluster.

To install with an action script on cluster install is pretty easy, you can do it in C# or powershell by adding a few lines of code to a standard custom create cluster script/program.

powershell:

# ADD SCRIPT ACTION TO CLUSTER CONFIGURATION$config = Add-AzureHDInsightScriptAction -Config $config -Name "Install Spark" -ClusterRoleCollection HeadNode -Urin https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv02/spark-installer-v02.ps1

C#:

// ADD THE SCRIPT ACTION TO INSTALL SPARKclusterInfo.ConfigActions.Add(new ScriptAction(  "Install Spark", // Name of the config action  new ClusterNodeType[] { ClusterNodeType.HeadNode }, // List of nodes to install Spark on  new Uri("https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv02/spark-installer-v02.ps1"), // Location of the script to install Spark  null //because the script used does not require any parameters.));

you can then RDP into the headnode and run use the spark-shell or use spark-submit to run jobs. I am not sure how would run spark job and not rdp into the the headnode but that is an other question.

azure apache-spark azure-hdinsight

I also asked the same question with Azure guys. Following is the solution from them:

"Two questions to the topic: 1. How can we submit a job outside of the cluster without "Remote to…" — Tao Li

Currently, this functionality is not supported. One workaround is to build job submission web service yourself:

Create Scala web service that will use Spark APIs to start jobs on the cluster.
Host this web service in the VM inside the same VNet as the cluster.
Expose web service end-point externally through some authentication scheme. You can also employ intermediate map reduce job, it would take longer though.

azure apache-spark azure-hdinsight

You might consider using Brisk (https://brisk.elastatools.com) which offers Spark on Azure as a provisioned service (with support available). There's a free tier and it lets you access blob storage with a wasb://path/to/files just like HDInsight.

It doesn't sit on YARN; instead it is a lightweight and Azure oriented distribution of Spark.

Disclaimer: I work on the project!

Best wishes,

Andy

CodeHunter

How to submit Apache Spark job to Hadoop YARN on Azure HDInsight

"Two questions to the topic: 1. How can we submit a job outside of the cluster without "Remote to…" — Tao Li

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last