What should be the size of the file in HDFS for best MapReduce job performance What should be the size of the file in HDFS for best MapReduce job performance hadoop hadoop

What should be the size of the file in HDFS for best MapReduce job performance


HDFS is designed to support very large files not small files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files.In HDFS architecture there is a concept of blocks. A typical block size used by HDFS is 64 MB. When we place a large file into HDFS it chopped up into 64 MB chunks(based on default configuration of blocks), Suppose you have a file of 1GBand you want to place that file in HDFS, then there will be 1GB/64MB = 16 split/blocks and these block will be distribute across the datanodesThe goal of splitting of file is parallel processing and fail over of data. These blocks/chunk will reside on a different DataNode based on your cluster configuration.

How mappers get assigned

Number of mappers is determined by the number of splits of your data in the MapReduce job. In a typical InputFormat, it is directly proportional to the number of files and file sizes.suppose your HDFS block configuration is configured for 64MB(default size) and you have a files with 100MB size then there will be 2 split and it will occupy 2 block and then 2 mapper will get assigned based on the blocks but suppose if you have 2 files with 30MB size(each file) then each file will occupy one block and mapper will get assigned based on that.

So you don't need to split the large file, but If you are dealing with very small files then it worth to combine them.

This link will be helpful to understand the problem with small files.

Please refer below link to get more detail about HDFS design.

http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html