Large Data Sets - NoSQL, NewSQL, SQL..? Brain Fried Large Data Sets - NoSQL, NewSQL, SQL..? Brain Fried hadoop hadoop

Large Data Sets - NoSQL, NewSQL, SQL..? Brain Fried


You need to think carefuly about what types of queries you will need to run over these docs. Cassandra etc may well be a good fit if your queries are basic, but richer SQL-like queries are not possible. The largest Cassandra deployments are of the order of 150TB, so your data volumes should not be a problem; but Cassandra performance may be overkill and will sacrifice query richness.

If you just want text indexing, then also consider Lucene, as I think for batch indexing Lucene can now handle over 100 GB/hour, so overnight indexing of 1TB would be possible - and Lucene now claims comparable speeds for incremental indexing too...


Checkout RavenDB. It is a document DB supporting Map/Reduce, which is based on Lucene and therefore can also provide full-text search capabilities natively from the querying API.

Sharding and replication capabilities are built-in, and very advanced. Using Esent as storage, each node can store up to 16TB of data.


Database mainly depends on your use cases. I will suggest you to go with cassandra or hbase.

For real time analysis over cassandra you can use Apache spark and spark streaming all are work well.

Also try Elastic search or solar search for text searching. All this are open source and very good to try.

For real time analysis you can have look to facebook opensource Prestodb as well but i didn't found much information needed apart from presto website and most of people suggesting to go with cassandra with apache spark.