ElasticSearch setup for a large cluster with heavy aggregations

elasticsearch cluster-computing aggregate scalability

Let me preface all of my answers/comments with the advice to try to, as much as possible, test these scenarios yourself. While Elasticsearch is very scalable, there are many tradeoffs that are heavily impacted by document size and type, ingestion and query volume, hardware and OS. While there are many wrong answers there is rarely one right answer.

I'm basing this response on a couple of active clusters with have with (currently) about half a million active documents in them, plus some recent benchmarking we performed at about 4X your volume (around 80M documents ingested per day during the benchmark).

1) First off you are not creating much of a hot spot with 3 nodes when you have even a single index with 5 shards and 1 replica per shard. Elasticsearch will separate each replica from it's primary to a different node, and in general will try to balance out the load of shards. Elasticsearch by default will hash on ID to pick the shard to index into (which then gets copied to the replica). Even with routing, you will only have a hot spot issue if you have single IDs that create large numbers of documents per day (which is the span of your index). Even then, it would not be a problem unless these IDs produce a significant percentage of the overall volume AND there are so few of them that you could get clumping on just 1 or 2 of the shards.

Only you can determine that based on your usage patterns - I'd suggest both some analysis of your existing data set to look for overly large concentrations and an analysis of your likely queries.

A bigger question I have is the nature of your queries. You aren't showing the full query nor the full schema (I see "generic.id" referenced but not in the document schema, and your query shows pulling up every single document within a time range - is that correct?). Custom routing for indexing is most useful when your queries are bound by an exact match on the field used for routing. So, if I had an index with everyone's documents in it and my query pattern was to only retrieve a single user's document in a single query, then custom routing by user id would be very useful to improve query performance and reduce overall cluster load.

A second thing to consider is the overall balance of ingestion vs. queries. You are ingesting over 20M documents a day - how many queries are you executing per day? If that number is <<< the ingestion rate you may want to think thru the need for a custom route. Also, if query performance is good or great you may not want to add the additional complexity.

Finally on indexing by ingestion date vs. created_at. We've struggled with that one too as we have some lag in receiving new documents. For now we've gone with storing by ingestion date as it's easier to manage and not a big issue to query multiple indexes at a time, particularly if you auto-create aliases for 1 week, 2 weeks, 1 month, 2 months etc. A bigger issue is what the distribution is - if you have documents that come in weeks or months later perhaps you want to change to indexing by created_at but that will require keeping those indexes online and open for quite some time.

We currently use several document indexes per day, basically "--" format. Practically this currently means 5 indexes per day. This allows us to be more selective about moving data in and out of the cluster. Not a recommendation for you just something that we've learned is useful to us.

2) Here's the great think about ES - with creating a new index each day you can adjust as time goes on to increase the number of shards per index. While you cannot change it for an existing index you are creating a new one every day and you can base your decision on real production analytics. You certainly want to watch the number and be prepared to increase the number of shards as/if you ingestion per day increases. It's not the simplest tradeoff - each one of those shards is a Lucene instance which potentially has multiple files. More shards per index is not free, as that multiplies with time. Given your use case of 6 months, that's over 1800 shards open across 3 nodes (182 days x 5 primaries and 5 replicas per day). There's multiples of files per shard likely open. We've found some level of overhead and impact on resource usage on our nodes as total shard count increased in the cluster into these ranges. Your mileage may vary but I'd be careful about increasing the number of shards per index when you are planning on keeping 182 indexes (6 months) at a time - that's quite a multiplier. I would definitely benchmark that ahead of time if you do make any changes to the default shard count.

3) There's not any way anyone else can predict query performance ahead of time for you. It's based on overall cluster load, query complexity, query frequency, hardware, etc. It's very specific to your environment. You are going to have to benchmark this. Personally given that you've already loaded data I'd use the ES snapshot and restore to bring this data up in a test environment. Try it with the default of 1 replica and see how it goes. Adding replica shards is great for data redundancy and can help spread out queries across the cluster but it comes at a rather steep price - 50% increase in storage plus each additional replica shard will bring additional ingestion cost to the node it runs on. It's great if you need the redundancy and can spare the capacity, not so great if you lack sufficient query volume to really take advantage of it.

4) Your question is incomplete (it ends with "we never") so I can't answer it directly - however a bigger question is why are you custom routing to begin with? Sure it can have great performance benefits but it's only useful if you can segment off a set of documents by the field you use to route. It's not entirely clear from your example data and partial query if that's the case. Personally I'd test it without any custom routing and then try the same with it and see if it has a significant impact.

5) Another question that will require some work on your part. You need to be tracking (at a minimum) JVM heap usage, overall memory and cpu usage, disk usage and disk io activity over time. We do, with thresholds set to alert well ahead of seeing issues so that we can add new members to the cluster early. Keep in mind that when you add a node to a cluster ES is going to try to re-balance the cluster. Running production with only 3 nodes with a large initial document set can cause issues if you lose a node to problems (Heap exhaustion, JVM error, hardware failure, network failure, etc). ES is going to go Yellow and stay there for quite some time while it reshuffles.

Personally for large document numbers and high ingestion I'd start adding nodes earlier. With more nodes in place it's less of an issue if you take a node out for maintenance. Concerning your existing configuration, how did you get to 8 TB of HDD per node? Given an ingestion of 8GB a day that seems like overkill for 6 months of data. I'd strongly suspect given the volume of data and number of indexes/shards you will want to move to more nodes which will even further reduce your storage per node requirement.

I'd definitely want to benchmark a maximum amount of documents per node by looping thru high volume ingestion and loops of normal query frequency on a cluster with just 1 or 2 nodes and see where it fails (either in performance, heap exhaustion or other issue). I'd then plan to keep the number of documents per node well below that number.

All that said I'd go out on a limb and say I doubt you'll be all that happy with 4 billion plus documents on 3 16GB nodes. Even if it worked (again, test, test, test) losing one node is going to be a really big event. Personally I like the smaller nodes but prefer lots of them.

Other thoughts - we initially benchmarked on 3 Amazon EC2 m1.xlarge instances (4 cores, 15 GB of memory) which worked fine over several days of ingestion at 80M documents a day which larger average document size than you appear to have. Biggest issue was the number of indexes and shards open (we were creating a couple of hundred new indexes per day with maybe a couple thousand more shards per day and this was creating issues). We've since added a couple of new nodes that have 30GB of memory and 8 cores and then added another 80M documents to test it out. Our current production approach is to keep prefer more moderately sized nodes as opposed to a few large ones.

UPDATE:

Regarding the benchmarking hardware, it was as stated above benchmarked on 3 Amazon EC2 m1.xlarge virtual instances running ubuntu 12.04 LTS and ES 1.1.0. We ran at about 80M documents a day (pull data out of a MongoDB database we had previously used). Each instance had 1 TB of storage via Amazon EBS, with provision IOPS of I believe 1000 IOPS. We ran for about 4-5 days. We appear to have been a bit cpu constrained at 80M a day and believe that more nodes would have increased our ingestion rate. As the benchmark ran and the number of indexes and shards increased we saw increasing memory pressure. We created a large number of indexes and shards (roughly 4 -5 indexes per 1 M documents or about 400 indexes per day, with 5 primary shards and 1 replica shard per index).

Regarding the index aliases, we're creating via a cron entry rolling index aliases for 1 week back, 2 weeks back, etc so that our application can just hit a known index alias and always run against a set time frame back from today. We're using the index aliases rest api to create and delete them:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-aliases.html

CodeHunter

ElasticSearch setup for a large cluster with heavy aggregations

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last