running hadoop wordCount example with groovy running hadoop wordCount example with groovy hadoop hadoop

running hadoop wordCount example with groovy


I was able to run this groovy file with hadoop 2.7.1 The procedure I followed is

  1. Install gradle
  2. Generate jar file using gradle. I asked this question which helped me build dependencies in gradle
  3. Run with hadoop as usual as we run a java jar file using this command from the folder where jar is located.

    hadoop jar buildSrc-1.0.jar in1 out4

where in1 is input file and out4 is the output folder in hdfs

EDIT- As the above link is broken , I am pasting the groovy file here.

import StartsWithCountMapperimport StartsWithCountReducerimport org.apache.hadoop.conf.Configuredimport org.apache.hadoop.fs.Pathimport org.apache.hadoop.io.IntWritableimport org.apache.hadoop.io.LongWritableimport org.apache.hadoop.io.Textimport org.apache.hadoop.mapreduce.Jobimport org.apache.hadoop.mapreduce.Mapperimport org.apache.hadoop.mapreduce.Reducerimport org.apache.hadoop.mapreduce.lib.input.TextInputFormatimport org.apache.hadoop.mapreduce.lib.output.TextOutputFormatimport org.apache.hadoop.util.Toolimport org.apache.hadoop.util.ToolRunnerclass CountGroovyJob extends Configured implements Tool {    @Override    int run(String[] args) throws Exception {        Job job = Job.getInstance(getConf(), "StartsWithCount")        job.setJarByClass(getClass())        // configure output and input source        TextInputFormat.addInputPath(job, new Path(args[0]))        job.setInputFormatClass(TextInputFormat)        // configure mapper and reducer        job.setMapperClass(StartsWithCountMapper)        job.setCombinerClass(StartsWithCountReducer)        job.setReducerClass(StartsWithCountReducer)        // configure output        TextOutputFormat.setOutputPath(job, new Path(args[1]))        job.setOutputFormatClass(TextOutputFormat)        job.setOutputKeyClass(Text)        job.setOutputValueClass(IntWritable)        return job.waitForCompletion(true) ? 0 : 1    }    static void main(String[] args) throws Exception {        System.exit(ToolRunner.run(new CountGroovyJob(), args))    }    class GroovyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {        private final static IntWritable countOne = new IntWritable(1);        private final Text reusableText = new Text();        @Override        protected void map(LongWritable key, Text value, Mapper.Context context) {            value.toString().tokenize().each {                reusableText.set(it)                context.write(reusableText,countOne)            }        }    }    class GroovyReducer extends Reducer<Text, IntWritable, Text, IntWritable>{        private IntWritable outValue = new IntWritable();        @Override        protected void reduce(Text key, Iterable<IntWritable> values, Reducer.Context context) {            outValue.set(values.collect({it.value}).sum())            context.write(key, outValue);        }    }}


The library you are using, groovy-hadoop, says it supports Hadoop 0.20.2. It's really old.

But the CountGroovyJob.groovy code you are trying to run looks like it's supposed to run on versions 2.x.x of Hadoop. I can see this because in the imports you see packages such as org.apache.hadoop.mapreduce.Mapper, whereas before version 2, it was called org.apache.hadoop.mapred.Mapper.

The most voted answer in the SO question you linked is probably the answer you needed. You have an incompatibility problem. The groovy-hadoop library can't work with your Hadoop 2.7.1.