Why HBase is a better choice than Cassandra with Hadoop?

hadoop cassandra nosql hbase cap-theorem

I don't think either is better than the others, it's not just one or the other. These are very different systems, each with their strengths and weaknesses, so it really depends on your use cases. They can definitely be used in complement of one another in the same infrastructure.

To explain the difference better I'd like to borrow a picture from Cassandra: the Definitive Guide, where they go over the CAP theorem. What they say is basically for any distributed system, you have to find a balance between consistency, availability and partition tolerance, and you can only realistically satisfy 2 of these properties. From that you can see that:

Cassandra satisfies the Availability and Partition Tolerance properties.
HBase satisfied the Consistency and Partition Tolerance properties.

CAP

When it comes to Hadoop, HBase is built on top of HDFS, which makes it pretty convenient to use if you already have a Hadoop stack. It is also supported by Cloudera, which is a standard enterprise distribution for Hadoop.

But Cassandra also has more integration with Hadoop, namely Datastax Brisk which is gaining popularity. You can also now natively stream data from the output of a Hadoop job into a Cassandra cluster using some Cassandra-provided output format (BulkOutputFormat for example), we are no longer to the point where Cassandra was just a standalone project.

In my experience, I've found that Cassandra is awesome for random reads, and not so much for scans

To put a little color to the picture, I've been using both at my job in the same infrastructure, and HBase has a very different purpose than Cassandra. I've used Cassandra mostly for real-time very fast lookups, while I've used HBase more for heavy ETL batch jobs with lower latency requirements.

This is a question that would truly be worthy of a blog post, so instead of going on and on I'd like to point you to an article which sums up a lot of the keys differences between the 2 systems. Bottom line is, there is no superior solution IMHO, and you should really think about your use cases to see which system is better suited.

hadoop cassandra nosql hbase cap-theorem

We have to compare pros & cons both databases and take a guarded decision depending on business requirements.

Cassandra

Pros:

Satisfies Availability & Partitioning of CAP theory & eventual consistent.
Scalable with large clusters with No Single Point of Failures
SQL like language for development allows developers to easily transition from RDBMS background
Cassandra has excellent single-row read performance as long as eventual consistency semantics are sufficient for the use-cases
Support from Datastax is a big advantage
Optimized for writes

Cons:

Does not support Range based row-scans
Does not support Atomic Compare and Set
Cassandra does not support co-processor functionality`
Cassandra supports secondary indexes on column families where the column name is known. (Not on dynamic columns).
Aggregations in Cassandra are not supported by the Cassandra nodes

HBase

Pros:

Strong consistency and meets Consistency & Partitioning of CAP theory.
RDBMS equivalent triggers & stored procedures
Hadoop support
Range based Row scans
Support Atomic Compare and Set
Optimized for reads, supported by single-write master
Support for Aggregation
High scalability & Data auto sharding

Cons:

Lacks friendly language for development
Does not support Read Load Balancing against a single row
Inter-row operations are not atomic
Single point of failure if only one HBase Master has been used

Have a look at article 1 , article 2 and this presentation for further details.

CodeHunter

Why HBase is a better choice than Cassandra with Hadoop?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last