Hadoop gzip input file using only one mapper [duplicate] Hadoop gzip input file using only one mapper [duplicate] hadoop hadoop

Hadoop gzip input file using only one mapper [duplicate]


Gzip files can't be split, so all the data is being processed by only one map. Some other compression algorithm in which compressed files can be split has to be used, then the data will be processed by multiple maps. Here is a nice article on it. (1)

Edit: Here is another article on Snappy (2) which is from Google.

(1) http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/

(2) http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/