Spark on yarn concept understanding

Adding to other answers.

Is it necessary that spark is installed on all the nodes in the yarncluster?

No, If the spark job is scheduling in YARN(either client or cluster mode). Spark installation is needed in many nodes only for standalone mode.

These are the visualizations of spark app deployment modes.

Spark Standalone Cluster

Spark standalone mode

In cluster mode driver will be sitting in one of the Spark Worker node whereas in client mode it will be within the machine which launched the job.

YARN cluster mode

YARN client mode

This table offers a concise list of differences between these modes:

differences among Standalone, YARN Cluster and YARN Client modes

pics source

It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client-side)configuration files for the Hadoop cluster". Why does the client node haveto install Hadoop when it is sending the job to cluster?

Hadoop installation is not mandatory but configurations(not all) are!. We can call them Gateway nodes. It's for two main reasons.

The configuration contained in HADOOP_CONF_DIR directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration.
In YARN mode the ResourceManager’s address is picked up from theHadoop configuration(yarn-default.xml). Thus, the --master parameter is yarn.

Update: (2017-01-04)

Spark 2.0+ no longer requires a fat assembly jar for productiondeployment. source

hadoop apache-spark hdfs hadoop-yarn

We are running spark jobs on YARN (we use HDP 2.2).

We don't have spark installed on the cluster. We only added the Spark assembly jar to the HDFS.

For example to run the Pi example:

./bin/spark-submit \  --verbose \  --class org.apache.spark.examples.SparkPi \  --master yarn-cluster \  --conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar \  --num-executors 2 \  --driver-memory 512m \  --executor-memory 512m \  --executor-cores 4 \  hdfs://master:8020/spark/spark-examples-1.3.1-hadoop2.6.0.jar 100

--conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar - This config tell the yarn from were to take the spark assembly. If you don't use it, it will upload the jar from were you run spark-submit.

About your second question: The client node doesn't not need Hadoop installed. It only needs the configuration files. You can copy the directory from your cluster to your client.

hadoop apache-spark hdfs hadoop-yarn

1 - Spark if following s slave/master architecture. So on your cluster, you have to install a spark master and N spark slaves. You can run spark in a standalone mode. But using Yarn architecture will give you some benefits. There is a very good explanation of it here : http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/

2- It is necessary if you want to use Yarn or HDFS for example, but as i said before you can run it in standalone mode.

CodeHunter

Spark on yarn concept understanding

Spark Standalone Cluster

YARN cluster mode

YARN client mode

Update: (2017-01-04)

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last