running hadoop wordCount example with groovy
I was able to run this groovy file with hadoop 2.7.1 The procedure I followed is
- Install gradle
- Generate jar file using gradle. I asked this question which helped me build dependencies in gradle
Run with hadoop as usual as we run a java jar file using this command from the folder where jar is located.
hadoop jar buildSrc-1.0.jar in1 out4
where in1
is input file and out4
is the output folder in hdfs
EDIT- As the above link is broken , I am pasting the groovy file here.
import StartsWithCountMapperimport StartsWithCountReducerimport org.apache.hadoop.conf.Configuredimport org.apache.hadoop.fs.Pathimport org.apache.hadoop.io.IntWritableimport org.apache.hadoop.io.LongWritableimport org.apache.hadoop.io.Textimport org.apache.hadoop.mapreduce.Jobimport org.apache.hadoop.mapreduce.Mapperimport org.apache.hadoop.mapreduce.Reducerimport org.apache.hadoop.mapreduce.lib.input.TextInputFormatimport org.apache.hadoop.mapreduce.lib.output.TextOutputFormatimport org.apache.hadoop.util.Toolimport org.apache.hadoop.util.ToolRunnerclass CountGroovyJob extends Configured implements Tool { @Override int run(String[] args) throws Exception { Job job = Job.getInstance(getConf(), "StartsWithCount") job.setJarByClass(getClass()) // configure output and input source TextInputFormat.addInputPath(job, new Path(args[0])) job.setInputFormatClass(TextInputFormat) // configure mapper and reducer job.setMapperClass(StartsWithCountMapper) job.setCombinerClass(StartsWithCountReducer) job.setReducerClass(StartsWithCountReducer) // configure output TextOutputFormat.setOutputPath(job, new Path(args[1])) job.setOutputFormatClass(TextOutputFormat) job.setOutputKeyClass(Text) job.setOutputValueClass(IntWritable) return job.waitForCompletion(true) ? 0 : 1 } static void main(String[] args) throws Exception { System.exit(ToolRunner.run(new CountGroovyJob(), args)) } class GroovyMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable countOne = new IntWritable(1); private final Text reusableText = new Text(); @Override protected void map(LongWritable key, Text value, Mapper.Context context) { value.toString().tokenize().each { reusableText.set(it) context.write(reusableText,countOne) } } } class GroovyReducer extends Reducer<Text, IntWritable, Text, IntWritable>{ private IntWritable outValue = new IntWritable(); @Override protected void reduce(Text key, Iterable<IntWritable> values, Reducer.Context context) { outValue.set(values.collect({it.value}).sum()) context.write(key, outValue); } }}
The library you are using, groovy-hadoop, says it supports Hadoop 0.20.2
. It's really old.
But the CountGroovyJob.groovy
code you are trying to run looks like it's supposed to run on versions 2.x.x
of Hadoop. I can see this because in the imports you see packages such as org.apache.hadoop.mapreduce.Mapper
, whereas before version 2, it was called org.apache.hadoop.mapred.Mapper
.
The most voted answer in the SO question you linked is probably the answer you needed. You have an incompatibility problem. The groovy-hadoop library can't work with your Hadoop 2.7.1.