Why HBase is a better choice than Cassandra with Hadoop? Why HBase is a better choice than Cassandra with Hadoop? hadoop hadoop

Why HBase is a better choice than Cassandra with Hadoop?


I don't think either is better than the others, it's not just one or the other. These are very different systems, each with their strengths and weaknesses, so it really depends on your use cases. They can definitely be used in complement of one another in the same infrastructure.

To explain the difference better I'd like to borrow a picture from Cassandra: the Definitive Guide, where they go over the CAP theorem. What they say is basically for any distributed system, you have to find a balance between consistency, availability and partition tolerance, and you can only realistically satisfy 2 of these properties. From that you can see that:

  • Cassandra satisfies the Availability and Partition Tolerance properties.
  • HBase satisfied the Consistency and Partition Tolerance properties.

CAP

When it comes to Hadoop, HBase is built on top of HDFS, which makes it pretty convenient to use if you already have a Hadoop stack. It is also supported by Cloudera, which is a standard enterprise distribution for Hadoop.

But Cassandra also has more integration with Hadoop, namely Datastax Brisk which is gaining popularity. You can also now natively stream data from the output of a Hadoop job into a Cassandra cluster using some Cassandra-provided output format (BulkOutputFormat for example), we are no longer to the point where Cassandra was just a standalone project.

In my experience, I've found that Cassandra is awesome for random reads, and not so much for scans

To put a little color to the picture, I've been using both at my job in the same infrastructure, and HBase has a very different purpose than Cassandra. I've used Cassandra mostly for real-time very fast lookups, while I've used HBase more for heavy ETL batch jobs with lower latency requirements.

This is a question that would truly be worthy of a blog post, so instead of going on and on I'd like to point you to an article which sums up a lot of the keys differences between the 2 systems. Bottom line is, there is no superior solution IMHO, and you should really think about your use cases to see which system is better suited.


We have to compare pros & cons both databases and take a guarded decision depending on business requirements.

Cassandra

Pros:

  1. Satisfies Availability & Partitioning of CAP theory & eventual consistent.
  2. Scalable with large clusters with No Single Point of Failures
  3. SQL like language for development allows developers to easily transition from RDBMS background
  4. Cassandra has excellent single-row read performance as long as eventual consistency semantics are sufficient for the use-cases
  5. Support from Datastax is a big advantage
  6. Optimized for writes

Cons:

  1. Does not support Range based row-scans
  2. Does not support Atomic Compare and Set
  3. Cassandra does not support co-processor functionality`
  4. Cassandra supports secondary indexes on column families where the column name is known. (Not on dynamic columns).
  5. Aggregations in Cassandra are not supported by the Cassandra nodes

HBase

Pros:

  1. Strong consistency and meets Consistency & Partitioning of CAP theory.
  2. RDBMS equivalent triggers & stored procedures
  3. Hadoop support
  4. Range based Row scans
  5. Support Atomic Compare and Set
  6. Optimized for reads, supported by single-write master
  7. Support for Aggregation
  8. High scalability & Data auto sharding

Cons:

  1. Lacks friendly language for development
  2. Does not support Read Load Balancing against a single row
  3. Inter-row operations are not atomic
  4. Single point of failure if only one HBase Master has been used

Have a look at article 1 , article 2 and this presentation for further details.