When using HBase as a source for MapReduce, can I extend TableInputFormatBase to create multiple splits and multiple mappers for each region? When using HBase as a source for MapReduce, can I extend TableInputFormatBase to create multiple splits and multiple mappers for each region? hadoop hadoop

When using HBase as a source for MapReduce, can I extend TableInputFormatBase to create multiple splits and multiple mappers for each region?


You need a custom input format that extends InputFormat. you can get idea how do this from answer to question I want to scan lots of data (Range based queries), what all optimizations I can do while writing the data so that scan becomes faster. This is a good idea if the time of data processing is more greater then data retrieving time.


Not sure if you can specify multiple mappers for a given region, but consider the following:

If you think one mapper is inefficient per region (maybe your data nodes don't have enough resources like #cpus), you can perhaps specify smaller regions sizes in the file hbase-site.xml.

here's a site for the default configs options if you want to look into changing that:http://hbase.apache.org/configuration.html#hbase_default_configurations

please note that by making the region size small, you will be increasing the number of files in your DFS, and this can limit the capacity of your hadoop DFS depending on the memory of your namenode. Remember, the namenode's memory usage is directly related to the number of files in your DFS. This may or may not be relavant to your situation as I do not know how your cluster is being used. There is never a silver bullet answer to these questions!


1 . Its absolutely fine just make sure the key set is mutually exclusive between the mappers .

  1. you arent creating too many clients as this may lead to lot of gc , as during hbase read hbase block cache churning happens