Can you use HDFS as your principal storage? Can you use HDFS as your principal storage? hadoop hadoop

Can you use HDFS as your principal storage?


HDFS is only as reliable as the Namenode(s) that maintain the file metadata. You'd better setup Namenode HA and take frequent snapshots of them, and externally store those away from HDFS.

If all Namenodes are unavailable, or their metadata storage is corrupted, you'll be unable to read the HDFS datanode data, despite those files being fine themselves, and highly available


Here are some considerations for storing your data in Hive vs HDFS (and/or HBase).

Hive:

  1. HDFS is a filesystem that supports fail-over and HA. HDFS will replicate the data in several datanodes based on the replication factor you have chosen. Hive is build on top of Hadoop therefore can store data in HDFS as well leveraging the pros of HDFS for HA.
  2. Hive utilizes predicates-pushdown providing huge performance benefits. Hive can also be combined with modern file formats such as parquet and ORC improving performance even more (utilizing predicates-pushdown).
  3. Hive provides very easy access to data via HQL (Hive Query Language) which is SQL like language.
  4. Hive works very well with Spark and you can combine them both aka retrieving Hive data into dataframes and saving dataframes into Hive.

HDFS/HBase:

  1. Hive is a warehouse system used for data analysis therefore Hive CRUD operations are relatively slower than direct access to HDFS files (or HBase which is build for fast CRUD operations). For instance in a streaming application saving data in HDFS or HBase will be much faster than in Hive. If you need fast storage (or insert queries) and you don't do any analysis on large datasets then you should prefer HDFS/HBase over Hive.
  2. If performance is very crucial for your application and therefore you prefer to skip the extra layer of Hive accessing HDFS files directly.
  3. The team decides not to use SQL.

Related post:

When to use Hadoop, HBase, Hive and Pig?