How wordCount mapReduce jobs, run on hadoop yarn cluster with apache tez? How wordCount mapReduce jobs, run on hadoop yarn cluster with apache tez? hadoop hadoop

How wordCount mapReduce jobs, run on hadoop yarn cluster with apache tez?


To answer your first question on converting MapReduce jobs to Tez DAGs:

Any MapReduce job can be thought of a single DAG with 2 vertices(stages). The first vertex is the Map phase and it is connected to a downstream vertex Reduce via a Shuffle edge.

There are 2 ways in which MR jobs can be run on Tez:

  1. One approach is to write a native 2-stage DAG using the Tez APIs directly. This is what is currently present in tez-examples.
  2. The second is to use the MapReduce APIs themselves and use the yarn-tez mode. In this scenario, there is a layer which intercepts the MR Job submission and instead of using MR, it translates the MR job into a 2-stage Tez DAG and executes the DAG on the Tez runtime.

For the data handling related questions that you have:

The user provides the logic on understanding the data to be read and how to split it. Tez then takes each split of data and takes over the responsibility of assigning a split or a set of splits to a given task.

The Tez framework then controls the generation and movement of data i.e. where to generate the data between intermediate steps and how to move data between 2 vertices/stages. However, it does not control the underlying data contents/structure, partitioning or serialization logic which is provided by user plugins.

The above is just a high level view with additional intricacies. You will get more detailed answers by posting specific questions to the Development list ( http://tez.apache.org/mail-lists.html )