Hadoop Mapreduce multiple Input files Hadoop Mapreduce multiple Input files hadoop hadoop

Hadoop Mapreduce multiple Input files


Based on the stacktrace, your output directory is not empty. So the simplest thing is actually to delete it before running the job:

bin/hadoop fs -rmr /user/cloudera/capital/output

Besides that, your arguments starting with the classname of your main class org.myorg.Capital. So that is the argument on the zero'th index. (Based on the stacktrace and the code you have provided).

Basically you need to shift all your indices one to the right:

Path cityInputPath = new Path(args[1]);Path countryInputPath = new Path(args[2]);Path outputPath = new Path(args[3]);MultipleInputs.addInputPath(job, countryInputPath, TextInputFormat.class, JoinCountryMapper.class);MultipleInputs.addInputPath(job, cityInputPath, TextInputFormat.class, JoinCityMapper.class);FileOutputFormat.setOutputPath(job, outputPath);

Don't forget to clear your output folder though!

Also a small tip for you, you can separate the files with comma "," so you can set them with a single call like this:

hadoop jar capital.jar org.myorg.Capital /user/cloudera/capital/input/City.dat,/user/cloudera/capital/input/Country.dat

And in your java code:

FileInputFormat.addInputPaths(job, args[1]);


What is happening here is that the class name is deemed to be the first argument!

By default, the first non-option argument is the name of the class to be invoked. A fully-qualified class name should be used. If the -jar option is specified, the first non-option argument is the name of a JAR archive containing class and resource f iles for the application, with the startup class indicated by the Main-Class manifest header.

So What I would suggest that you add a Manifest files to your jar where in you specify the main class. Your MANIFEST.MF files may look like:

Manifest-Version: 1.0Main-Class: org.myorg.Capital

And now your command would look like:

hadoop jar capital.jar /user/cloudera/capital/input/City.dat /user/cloudera/capital/input/Country.dat /user/cloudera/capital/output

You can certainly just change the index values being used in your code but that's not advisable solution.


can you try this:

hadoop jar capital.jar /user/cloudera/capital/input /user/cloudera/capital/output

This should read all files in the single input directory.