HBase MemStore and Garbage Collection HBase MemStore and Garbage Collection hadoop hadoop

HBase MemStore and Garbage Collection


You are right about Hbase Memstore. In general when something is written to HBase, it is first written to an in-memory store (memstore), once this memstore reaches a certain size*, it is flushed to disk into a store file (everything is also written immediately to a log file for durability).

*From Global perspective, HBase uses by default 40% of the heap (see property hbase.regionserver.global.memstore.upperLimit) for all memstores of all regions of all column families of all tables. If this limit is reached, it starts flushing some memstores until the memory used by memstores is below at least 35% of heap (lowerLimit property). This is adjustable but you would need to have perfect calculation to have this change.

Yes GC does impact on memstore and you can actually modify this behavior by using Memstore-local allocation buffer. I would suggest you to read the 3 part article on "Avoiding Full GCs in HBase with MemStore-Local Allocation Buffers" as below :http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/


The problem is that java as a technology has a problem to handle server which creates and delete a lot of objects and, in the same time, should respond to all requests in the timely moment. The root cause is a garbage collector which should, sometimes, do so called "stop the world" and clean up the memory. In large heaps it can cause delay of several seconds.
Now let see why it happens to HBase and why it has to respond in timely fasion.
Memstore is a cache of the region data. If data is highly volotile a lot of objects are created/deleted. As a result there is a lot of GC (Garbage collector ) pressuer.
HBase, as any real time system working with big data sets tends to cache as much as possible, and, thereof its MemStores are big.
HBase Region Servers has to communicate with ZooKeeper in timely fashion to let is know they are alive and avoid migration. Long GC pacuse can prevent it.
What cloudera did - implemented own memory management mechanism special for MemStore to avoid GC pauses.Larse in his book describes how to tune GC to make it work better with Region Server.
http://books.google.co.il/books?id=Ytbs4fLHDakC&pg=PA419&lpg=PA419&dq=MemStore+garbage+collector+HBASE&source=bl&ots=b-Sk-HV22E&sig=tFddqrJtlE_nIUI3VDMEyHdgx6o&hl=iw&sa=X&ei=79CyT82BIM_48QO_26ykCQ&ved=0CHUQ6AEwCQ#v=onepage&q=MemStore%20garbage%20collector%20HBASE&f=false