Hadoop MapReduce read the data set once for multiple jobs Hadoop MapReduce read the data set once for multiple jobs hadoop hadoop

Hadoop MapReduce read the data set once for multiple jobs


It might be possible by using a custom partitioner. The custom partitioner will redirect the output of the mapper to appropriate reducer based on the key. So, the key of the mapper output would be R1*, R2*, R3***. Need to look into the pros and the cons of this approach.

As mentioned Tez is one of the alternative, but it is still under the incubator phase.


You can:

  1. Have your reducer(s) do all the Analytics (1-3) in the same pass/job. EDIT: From your comment I see that this alternative is not useful for you, but I am leaving it here for future reference, since in some cases it is possible to do this.
  2. Use a more generalized model than MapReduce. For example, Apache Tez (still an incubator project) can be used for your use case.

Some useful references on Apache Tez:

EDIT: Added the following regarding Alternative 1:

You could also make the mapper generate a key indicating to which analytics process the output is intended. Hadoop would automatically group records by this key, and send them all to the same reducer. The value generated by the mappers would be a tuple in the form <k,v>, where the key (k) is the original key you intended to generate. Thus, the mapper generates <k_analytics, <k,v>> records. The reducer, has a reducer method that reads the key, and depending on the key, calls the appropriate analytics method (within your reducer class). This approach would work, but only if your reducers do not have to deal with huge amounts of data, since you'll likely need to keep it in memory (in a list or a hashtable) while you do the analytics process (as the <k,v> tuples won't be sorted by their key). If this is not something your reducer can handle, then the custom partitioner suggested by @praveen-sripati may be an option to explore.

EDIT: As suggested by @judge-mental, alternative 1 can be further improved by having the mappers issue <<k_analytics, k>, value>; in other words, make the key within the analytics type part of the key, instead of the value, so that a reducer will receive all the keys for one analytics job grouped together and can perform streaming operations on the values without having to keep them in RAM.