How to specifically determine input for each map step in MRJob? How to specifically determine input for each map step in MRJob? hadoop hadoop

How to specifically determine input for each map step in MRJob?


You can use Runners

You will have to define the jobs separately and use another python script to invoke it.

from NumLines import NumLinesfrom WordsPerLine import WordsPerLineimport sysintermediate = Nonedef firstJob(input_file):    global intermediate    mr_job = NumLines(args=[input_file])    with mr_job.make_runner() as runner:        runner.run()        intermediate = runner.get_output_dir()def secondJob(input_file):    mr_job = WordsPerLine(args=[intermediate,input_file])    with mr_job.make_runner() as runner:        runner.run()if __name__ == '__main__':    firstJob(sys.argv[1])     secondJob(sys.argv[1])

and can be invoked by:

python main_script.py input.txt