Mapper input Key-Value pair in Hadoop

hadoop mapreduce key-value

The input to the mapper depends on what InputFormat is used. The InputFormat is responsible for reading the incoming data and shaping it into whatever format the Mapper expects.The default InputFormat is TextInputFormat, which extends FileInputFormat<LongWritable, Text>.

If you do not change the InputFormat, using a Mapper with different Key-Value type signature than <LongWritable, Text> will cause this error. If you expect <Text, Text> input, you will have to choose an appropiate InputFormat. You can set the InputFormat in Job setup:

job.setInputFormatClass(MyInputFormat.class);

And like I said, by default this is set to TextInputFormat.

Now, let's say your input data is a bunch of newline-separated records delimited by a comma:

"A,value1"
"B,value2"

If you want the input key to the mapper to be ("A", "value1"), ("B", "value2") you will have to implement a custom InputFormat and RecordReader with the <Text, Text> signature. Fortunately, this is pretty easy. There is an example here and probably a few examples floating around StackOverflow as well.

In short, add a class which extends FileInputFormat<Text, Text> and a class which extends RecordReader<Text, Text>. Override the FileInputFormat#getRecordReader method, and have it return an instance of your custom RecordReader.

Then you will have to implement the required RecordReader logic. The simplest way to do this is to create an instance of LineRecordReader in your custom RecordReader, and delegate all basic responsibilities to this instance. In the getCurrentKey and getCurrentValue-methods you will implement the logic for extracting the comma delimited Text contents by calling LineRecordReader#getCurrentValue and splitting it on comma.

Finally, set your new InputFormat as Job InputFormat as shown after the second paragraph above.

hadoop mapreduce key-value

In the book "Hadoop: The Difinitive Guide" by Tom White I think he has an appropriate answer to this(pg. 197):

"TextInputFormat’s keys, being simply the offset within the file, are not normally veryuseful. It is common for each line in a file to be a key-value pair, separated by a delimitersuch as a tab character. For example, this is the output produced by TextOutputFormat, Hadoop’s default OutputFormat. To interpret such files correctly, KeyValueTextInputFormat is appropriate.

You can specify the separator via the key.value.separator.in.input.line property. Itis a tab character by default."

hadoop mapreduce key-value

Key for Mapper Input will always be a Integer type....the mapper input key indicates the line's offset no. and the values indicates the whole line ......record reader reads a single line in first cycle. And o/p of the mapper can be whatever u want (it can be (Text,Text) or (Text, IntWritable) or ......)

CodeHunter

Mapper input Key-Value pair in Hadoop

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last