Understanding mapreduce.framework.name wrt Hadoop Understanding mapreduce.framework.name wrt Hadoop hadoop hadoop

Understanding mapreduce.framework.name wrt Hadoop


  • yarn stands for MR version 2.
  • classic is for MR version 1
  • local for local runs of the MR jobs.

MR V1 and MR V2 are just about how resources are managed and a job is executed. The current hadoop release is capable of both (and even in local lightweight mode). When you set the value as yarn, you are simply instructing the framework to use yarn way to execute the job. Similarly when you set it to local, you just telling the framework that there is no cluster for execution and its all within a JVM. It is not a different infrastructure for MR V1 and MR V2 framework; its just the way of job execution, which changes.

jobTracker, TaskTracker etc are all just daemon thread, which are spawned when needed and killed.

MRv1 uses the JobTracker to create and assign tasks to data nodes. This was found to be too inefficient when dealing with large cluster, leading to yarn

MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each data node runs a Node Manager. For each job, one slave node will act as the Application Master, monitoring resources/tasks, etc.

Local mode is given to simulate and debug MR application within a single machine/JVM.

EDIT: Based on comments

jps (Java Virtual Machine Process Status)is a JVM tool, which according to official page:

The jps tool lists the instrumented HotSpot Java Virtual Machines (JVMs) on the target system. The tool is limited to reporting information on JVMs for which it has the access permissions.

So,

  1. jps is not a big data tool, rather a java tool which tells about JVM, however it does not divulge any information on processes running within the JVM.

  2. It only list the JVM, it has access to. It means there still be certain JVMs which remains undetected.

Keeping the above points in mind, if you observed that jsp command emits different result based on hadoop deployment mode:

  1. Local (or Standalone) mode: There are no daemons and everything runs on a single JVM.
  2. Pseudo-Distributed mode: Each daemon(Namenode, Datanode etc) runs on its own JVM on a single host.
  3. Distributed mode: Each Daemon run on its own JVM across a cluster of hosts.

Hence each of the processes may or may not run in same JVM and hence jps output will be different.

Now in distributed mode, the MR v2 framework works in default mode. i.e. yarn; hence you see yarn specific daemons running

NamenodeDatanodeResourceManagerNodeManager

Apache Hadoop 1.x (MRv1) consists of the following daemons:

NamenodeDatanodeJobtrackerTasktracker

Note that NameNode and DataNode are common between two, because they are HDFS specific daemon, while other two are MR v1 and yarn specific.