Searching across shards? Searching across shards? database database

Searching across shards?


There is no magic bullet.

Searching each shard in succession is out of the question, obviously, due to the incredibly high latency you will incur.

So you want to search in parallel, if you have to.

There are two realistic options, and you already listed them -- indexing, and parallelized search. Allow me to go into a little more detail on how you would go about designing them.

The key insight you can use is that in search, you rarely need the complete set of results. You only need the first (or nth) page of results. So there is quite a bit of wiggle room you can use to decrease response time.

Indexing

If you know the attributes on which the users will be searched, you can create custom, separate indexes for them. You can build your own inverted index, which will point to the (shard, recordId) tuple for each search term, or you can store it in the database. Update it lazily, and asynchronously. I do not know your application requirements, it might even be possible to just rebuild the index every night (meaning you will not have the most recent entries on any given day -- but that might be ok for you). Make sure to optimize this index for size so it can fit in memory; note that you can shard this index, if you need to.

Naturally, if people can search for something like "lastname='Smith' OR lastname='Jones'", you can read the index for Smith, read the index for Jones, and compute the union -- you do not need to store all possible queries, just their building parts.

Parallel Search

For every query, send off requests to every shard unless you know which shard to look for because the search happens to be on the distribution key. Make the requests asynchronous. Reply to the user as soon as you get the first page-worth of results; collect the rest and cache locally, so that if the user hits "next" you will have the results ready and do not need to re-query the servers. This way, if some of the servers are taking longer than others, you do not need to wait on them to service the request.

While you are at it, log the response times of the sharded servers to observe potential problems with uneven data and/or load distribution.


I'm assuming you are talking about shards a la :http://highscalability.com/unorthodox-approach-database-design-coming-shard

If you read that article he goes into some detail on exactly your question, but long answer short, you write custom application code to bring your disparate shards together. You can do some smart hashing to both query individual shards and insert data into shards. You need to ask a more specific question to get a more specific answer.


You actually do need every search to hit every shard, or at least every search needs to be performed against an index that contains the data from all shards, which boils down to the same thing.

Presumably you shard based on a single property of the user, probably a hash of the username. If your search feature allows the user to search based on other properties of the user it is clear that there is no single shard or subset of shards that can satisfy a query, because any shard could contain users that match the query. You can't rule out any shards before performing the search, which implies that you must run the query against all shards.