Hadoop MapReduce read the data set once for multiple jobs

performance hadoop io mapreduce reduce

It might be possible by using a custom partitioner. The custom partitioner will redirect the output of the mapper to appropriate reducer based on the key. So, the key of the mapper output would be R1*, R2*, R3***. Need to look into the pros and the cons of this approach.

As mentioned Tez is one of the alternative, but it is still under the incubator phase.

performance hadoop io mapreduce reduce

You can:

Have your reducer(s) do all the Analytics (1-3) in the same pass/job. EDIT: From your comment I see that this alternative is not useful for you, but I am leaving it here for future reference, since in some cases it is possible to do this.
Use a more generalized model than MapReduce. For example, Apache Tez (still an incubator project) can be used for your use case.

Some useful references on Apache Tez:

Research paper that describes Apache YARN and related projects, including Apache Tez.
Several blog posts explaining Tez's model.

EDIT: Added the following regarding Alternative 1:

You could also make the mapper generate a key indicating to which analytics process the output is intended. Hadoop would automatically group records by this key, and send them all to the same reducer. The value generated by the mappers would be a tuple in the form <k,v>, where the key (k) is the original key you intended to generate. Thus, the mapper generates <k_analytics, <k,v>> records. The reducer, has a reducer method that reads the key, and depending on the key, calls the appropriate analytics method (within your reducer class). This approach would work, but only if your reducers do not have to deal with huge amounts of data, since you'll likely need to keep it in memory (in a list or a hashtable) while you do the analytics process (as the <k,v> tuples won't be sorted by their key). If this is not something your reducer can handle, then the custom partitioner suggested by @praveen-sripati may be an option to explore.

EDIT: As suggested by @judge-mental, alternative 1 can be further improved by having the mappers issue <<k_analytics, k>, value>; in other words, make the key within the analytics type part of the key, instead of the value, so that a reducer will receive all the keys for one analytics job grouped together and can perform streaming operations on the values without having to keep them in RAM.

CodeHunter

Hadoop MapReduce read the data set once for multiple jobs

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last