Massive Database w/ Fulltext Search - Sphinx, Lucene, Cassandra, MongoDB, CouchDB [closed] Massive Database w/ Fulltext Search - Sphinx, Lucene, Cassandra, MongoDB, CouchDB [closed] mongodb mongodb

Massive Database w/ Fulltext Search - Sphinx, Lucene, Cassandra, MongoDB, CouchDB [closed]


Storing data and searching are two different things. If you look at architectures like ebay, they have seperate services & servers for search operation. 50m rows is nothing, you can store it with any of the datastores, none of them is perfect so the difference is use cases. Eg: cassandra has the fastest insert performance with any data size, can scale to petabytes with hundreds of machines easyly (no need to shard), has lucandra (cassndra-lucene integration, scales well with massive data but a toy when compared to elasticsearch), high durability,... MongoDB has more query options (uses btree as a dbms), has autosharding recently, can index all fields, but poor durability,... Postgresql is the most advanced opensource dbms out there, has builtin master/slave replication recently, can scale by sharding, acid & sql compliant... couchdb has not any advantage compared to others in a use case I think, it's damn slow, If I need acid I probably use postgresql. Builtin fullText search functionality with these datastores has some problems and not scalable.

The most advenced (massive data, high performance, simple, distributed, fault tolerant, rest api) open source search engine is elasticsearch, you can think of it as distributed lucene. Solr is lagecy compared to elascticsearch. use of raw lucene/sphinx is not scalable.

If I were you, I probably choose one of the datastores and use elasticsearh for indexing and synhronize them on my data access layer (need to modify indexes on db insert/update/delete).

Regards


Paul, welcome to SO. This isn't a really the right place to try to get someone to work for you, but here's my advice:

Truthfully depending on the types of searches you are doing writing MySql off may be a bit premature.

Since it's product data I'd imagine your searches are fulltext searches, so writing off MySql isn't premature. Sphinx is great but a bit of a pain to configure. The benefit is that it has the ability to index from mysql directly, and you can also interface with it with whatever mysql connector/bindings you are using in your application because it knows how to talk mysql's protocol.

I'd say cassandra, couch, and mongo are not really what you are looking for, none of them natively index text the way sphinx does. You could roll your own on top of them but it would be pretty counterproductive.

I've never worked with lucene but I've heard good things, it's a similar solution to Sphinx afaik.

good luck