Hadoop Mapreduce multiple Input files
Based on the stacktrace, your output directory is not empty. So the simplest thing is actually to delete it before running the job:
bin/hadoop fs -rmr /user/cloudera/capital/output
Besides that, your arguments starting with the classname of your main class org.myorg.Capital
. So that is the argument on the zero'th index. (Based on the stacktrace and the code you have provided).
Basically you need to shift all your indices one to the right:
Path cityInputPath = new Path(args[1]);Path countryInputPath = new Path(args[2]);Path outputPath = new Path(args[3]);MultipleInputs.addInputPath(job, countryInputPath, TextInputFormat.class, JoinCountryMapper.class);MultipleInputs.addInputPath(job, cityInputPath, TextInputFormat.class, JoinCityMapper.class);FileOutputFormat.setOutputPath(job, outputPath);
Don't forget to clear your output folder though!
Also a small tip for you, you can separate the files with comma "," so you can set them with a single call like this:
hadoop jar capital.jar org.myorg.Capital /user/cloudera/capital/input/City.dat,/user/cloudera/capital/input/Country.dat
And in your java code:
FileInputFormat.addInputPaths(job, args[1]);
What is happening here is that the class name is deemed to be the first argument!
By default, the first non-option argument is the name of the class to be invoked. A fully-qualified class name should be used. If the -jar option is specified, the first non-option argument is the name of a JAR archive containing class and resource f iles for the application, with the startup class indicated by the Main-Class manifest header.
So What I would suggest that you add a Manifest files to your jar where in you specify the main class. Your MANIFEST.MF files may look like:
Manifest-Version: 1.0Main-Class: org.myorg.Capital
And now your command would look like:
hadoop jar capital.jar /user/cloudera/capital/input/City.dat /user/cloudera/capital/input/Country.dat /user/cloudera/capital/output
You can certainly just change the index values being used in your code but that's not advisable solution.
can you try this:
hadoop jar capital.jar /user/cloudera/capital/input /user/cloudera/capital/output
This should read all files in the single input directory.