Storing data in HBase vs Parquet files Storing data in HBase vs Parquet files hadoop hadoop

Storing data in HBase vs Parquet files


As you have already said in question, parquet is a storage while HBase is storage(HDFS) + Query Engine(API/shell) So a valid comparison should be done between parquet+Impala/Hive/Spark and HBase. Below are the key differences -

1) Disk space - Parquet takes less disk space in comparison to HBase. Parquet encoding saves more space than block compression in HBase.

2) Data Ingestion - Data ingestion in parquet is more efficient than HBase. A simple reason could be point 1. As in case of parquet, less data needs to be written on disk.

3) Record lookup on key - HBase is faster as this is a key-value storage while parquet is not. Indexing in parquet will be supported in future release.

4) Filter and other Scan queries - Since parquet store more information about records stored in a row group, it can skip lot of records while scanning the data. This is the reason, it's faster than HBase.

5) Updating records - HBase provides record updates while this may be problematic in parquet as the parquet files needs to be re-written. A careful design of schema and partitioning may improve updates but it's not comparable with HBase.

By comparing the above features, HBase seems more suitable for situations where updates are required and queries involve mainly key-value lookup. Query involving key range scan will also have better performance in HBase.

Parquet is suitable for use cases where updates are very few and queries involves filters, joins and aggregations.