Spark-Cassandra Vs Spark-Elasticsearch Spark-Cassandra Vs Spark-Elasticsearch elasticsearch elasticsearch

Spark-Cassandra Vs Spark-Elasticsearch


The short exact answer is "it depends", mostly on cluster sizes =)

I wouldn't chose Elastisearch as a primary source for the data, because it's good at searching. Searching is a very specific task and it requires a very specific approach, which in this cases uses inverted index to store actual data. Each field basically goes into separate index and because of that the indexes are very compact. Although it's possible to store into index complete objects, such an index will hardly get any benefit of compression. That requires much more disk space to store indexes and much more cpu clocks, spinning disks to process them.

Cassandra on the other hand is pretty good at storing and retrieving data.

Without any more or less specific requirements, I'd say that Cassandra is good at being primary storage (and provides pretty simple search scenarios) and ES is good at searching.


I will refute Evgenii answer about how ES is only good at searching.YES ES exceed at text search but it doesnt mean it can't do data.

You can actually treat it as if it was "Mongo" style Documentation and run "filter" queries on it to have fast fetch results. HOWEVER the question now becomes: how fast do you need your read/write and do you need any distributions? What ES lacks is distribution. Yes ES can do sharding but it has issues doing multi region distribution and reliability of replication of your data.

If you need the flexibility / reliability of your data I would swing to Cassanda. Also since you are dealing with TB - Cassandra might be a winner too because it is fitted for extreme volume.

If you need an easier time to to run searches (not limited to text search, eg: geo spacial you can do too) then ES might be a better fit. (note for the shear volume you are doing, you will need to shard in order to distribute your load).