Multiple rows insertion in HBase using MapReduce Multiple rows insertion in HBase using MapReduce hadoop hadoop

Multiple rows insertion in HBase using MapReduce


I prefer second option where batching is natural(no need for list of puts) for mapreduce.... to have deep insight please see my second point

1) Your first option List<Put> is generally used for Standalone Hbase Java client. Internally it is controlled by hbase.client.write.buffer like below in one of your config xmls

<property>         <name>hbase.client.write.buffer</name>         <value>20971520</value> // around 2 mb i guess </property>

which has default value say 2mb size. once you buffer is filled then it will flush all puts to actually insert in to your table. which is same way as BufferedMutator as explained in #2

2) Regarding second option, if you see TableOutputFormat documentation

org.apache.hadoop.hbase.mapreduceClass TableOutputFormat<KEY>java.lang.Objectorg.apache.hadoop.mapreduce.OutputFormat<KEY,Mutation>org.apache.hadoop.hbase.mapreduce.TableOutputFormat<KEY>All Implemented Interfaces:org.apache.hadoop.conf.Configurable@InterfaceAudience.Public@InterfaceStability.Stablepublic class TableOutputFormat<KEY>extends org.apache.hadoop.mapreduce.OutputFormat<KEY,Mutation>implements org.apache.hadoop.conf.ConfigurableConvert Map/Reduce output and write it to an HBase table. The KEY is ignored

while the output value must be either a Put or a Delete instance.

-- Other way of seeing this through code is like below.

/**     * Writes a key/value pair into the table.     *     * @param key  The key.     * @param value  The value.     * @throws IOException When writing fails.     * @see RecordWriter#write(Object, Object)     */    @Override    public void write(KEY key, Mutation value)    throws IOException {      if (!(value instanceof Put) && !(value instanceof Delete)) {        throw new IOException("Pass a Delete or a Put");      }      mutator.mutate(value);    }  }

conclusion : context.write(rowkey,putlist) It is not possible with API.

However, BufferedMutator ( from mutator.mutate in above code) says

Map/reduce jobs benefit from batching, but have no natural flush point. {@code BufferedMutator} receives the puts from the M/R job and will batch puts based on some heuristic, such as the accumulated size of the puts, and submit batches of puts asynchronously so that the M/R logic can continue without interruption.

so your batching is natural(with BufferedMutator) as aforementioned