What is mean by implementing a advanced job control framework to help chain multiple Map-Reduce jobs? What is mean by implementing a advanced job control framework to help chain multiple Map-Reduce jobs? hadoop hadoop

What is mean by implementing a advanced job control framework to help chain multiple Map-Reduce jobs?


It looks like the project you are referring to might be related to this Jira ticket.

Right now the JobControl class is pretty bare, and it's missing a number of functionalities which could make a user's life easier. For example:

  • Ability to get notifications when the job changes state: right now you just call JobControl.run and that's it, but in practice it could be interesting if I could get notified when something changes in my job.
  • Re-submit failed jobs: you could implement a facility to resubmit a job when/if it fails, for example you could have a max number of retries parameter in the ControlledJob class and retry up to that point before sending a notification that it failed.
  • A lot of jobs are run on a regular basis, either weekly, daily, hourly, ... This is typically done via crontab, so it could be interesting to have this feature embedded in Hadoop, for example users could set a recurring job by specifying a period, and the JobControl would run it at these regular intervals.
  • Maybe have a UI to visualize your jobflow and each job's dependencies, which steps have already been completed and whice haven't.
  • It could be interesting to have the ability to not only launch Map/Reduce jobs, but also Hive, Pig for example, so you could provide a generic interface for users to submit any kind of job and monitore them seamlessly.

In the end I don't think you need to reinvent a completely new framework, the JobControl class already provides a good starting point. Try to think with the point of view of the user, what can you do to make it easier and shorter to submit and manage jobs. The ideas here and in the ticket are only example, you are free to come with your own ideas.

As far as Oozie is concerned, it gives you a higher abstraction for controlling a jobflow, but it's also more complex to setup and should be reserved for more complex jobs. I know for a fact that some people are hesitant to use Oozie because it adds overhead to your applications. The big difference also is that Oozie is a server while JobControl just runs on the client machine, which is additional overhead. While some of the features mentionned above are present in Oozie in 1 way or the other, the ability to keep it simple and running on the client machine without needing extra work like Oozie is in my opinion the key to your project.