Hadoop MapReduce RecordReader Implementation Necessary?

java hadoop mapreduce input-split recordreader

Input splits are logical references to data. If you look at the API, you can see that it doesn't know anything about the record boundaries. A mapper is launched for every input split. A mapper's map() is run for every record(In a WordCount program, every line in a file).

But how does a mapper know where the record boundaries are?

This is where your quote from Hadoop MapReduce InputFormat Interface comes in -

the application has to also implement a RecordReader on whom lies the responsibilty to respect record-boundaries and present a record-oriented view of the logical InputSplit to the individual task

Every mapper is associated with an InputFormat. That InputFormat has information on which RecordReader to use. Look at the API, you will find that it knows about the input splits and what record reader to use.If you want to know some more about input splits and Record reader, you should read this answer.

A RecordReader defines what the record boundaries are; The InputFormat defines what RecordReader is used.

The WordCount program does not specify any InputFormat, it therefore defaults to TextInputFormat which uses LineRecordReader and gives out every line as a different record. And this your source code

[L]ogical splits based on input-size is insufficient for many applications since record boundaries are to be respected.

What this means is that, for an example file such as

a b c d ef g h i jk l m n o

and we want every line to be a record. when the logical splits are based on input size, it is possible there may be two splits such as:

a b c d ef g

and

    h i j k l m n 0

If it wasn't for the RecordReader, it would've considered f g and h i j to be different records; Clearly, this isn't what most applications want.

Answering your question, in the WordCount program, it does not really matter what the record boundaries are but there is a possibility that the same word is split into different logical splits. Therefore, logical splits based on size are not sufficient for WordCount program.

Every MapReduce program 'respects' record boundaries. Otherwise, it is of not much use.

java hadoop mapreduce input-split recordreader

You cannot see the RecorderReader implementation in WordCount Example as it uses default RecordReader and default InputSplit, specified in the framework.

If you want to see their implementation, you can find it in Hadoop Source code.

For more info on Recorder readers and how they work, pl. refer: https://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/

CodeHunter

Hadoop MapReduce RecordReader Implementation Necessary?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last