Can you use HDFS as your principal storage?

hadoop hdfs storage data-lake

HDFS is only as reliable as the Namenode(s) that maintain the file metadata. You'd better setup Namenode HA and take frequent snapshots of them, and externally store those away from HDFS.

If all Namenodes are unavailable, or their metadata storage is corrupted, you'll be unable to read the HDFS datanode data, despite those files being fine themselves, and highly available

hadoop hdfs storage data-lake

Here are some considerations for storing your data in Hive vs HDFS (and/or HBase).

Hive:

HDFS is a filesystem that supports fail-over and HA. HDFS will replicate the data in several datanodes based on the replication factor you have chosen. Hive is build on top of Hadoop therefore can store data in HDFS as well leveraging the pros of HDFS for HA.
Hive utilizes predicates-pushdown providing huge performance benefits. Hive can also be combined with modern file formats such as parquet and ORC improving performance even more (utilizing predicates-pushdown).
Hive provides very easy access to data via HQL (Hive Query Language) which is SQL like language.
Hive works very well with Spark and you can combine them both aka retrieving Hive data into dataframes and saving dataframes into Hive.

HDFS/HBase:

Hive is a warehouse system used for data analysis therefore Hive CRUD operations are relatively slower than direct access to HDFS files (or HBase which is build for fast CRUD operations). For instance in a streaming application saving data in HDFS or HBase will be much faster than in Hive. If you need fast storage (or insert queries) and you don't do any analysis on large datasets then you should prefer HDFS/HBase over Hive.
If performance is very crucial for your application and therefore you prefer to skip the extra layer of Hive accessing HDFS files directly.
The team decides not to use SQL.

CodeHunter

Can you use HDFS as your principal storage?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last