Hadoop on cassandra database Hadoop on cassandra database hadoop hadoop

Hadoop on cassandra database


If you interested to marry Hadoop and Cassandra - the first link should DataStax company which is built around this concept. http://www.datastax.com/They built and support hadoop with HDFS replaced with cassandra.In best of my understanding - they do have data locality:http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/

There is good answer about Hadoop & Cassandra data locality if you run MapReduce against cassandra Cassandra and MapReduce - minimal setup requirements

Regarding your question - there is a tradeof:a) If you run Hadoop / Hive on separate nodes you loose data locality and thereof your data throughput is limited by your network bandwidth.
b) If you run hadoop / Hive on the same nodes as cassandra runs - you can get data locality but MapReduce processing behind hive queries might clogg your network (and other resources) and thereof affect your quality of service from cassandra.

My suggestion will be to have separate hive nodes if performance of your cassandra cluster are critical.
If your cassandra is mostly used as a data store and do not handle real-time requests - then running hive on each node will improve performance and hardware utilization.