Data load from HDFS to ES taking very long time Data load from HDFS to ES taking very long time elasticsearch elasticsearch

Data load from HDFS to ES taking very long time


You don't mention the methodology you're using to feed the data into ES so it's hard to see if you're using an ingestion pipeline or what technology to bridge the gap. Given that, I'll stick with generic advice on how to optimize ingestion into Elasticsearch.

Elastic has published some guidance for optimizing systems for ingestion, and there are three points that we've found do make a real difference:

  1. Turn Off Replicas: Set the number of replicas to zero while injesting the data to eliminate the need to copy the data while also injesting it. This is an index-level setting ("number_of_replicas")
  2. Don't Specify an ID: It isn't clear from your database schema if you are mapping across any identifiers, but if you can avoid specifying a document Id to Elastic and let it specify its own that significantly improves performance.
  3. Use Parallel Bulk Operators: Use the BulkAPI to push data into ES and feed it with multiple threads so it always has more than one Bulk request to work on server-side.

Finally, have you installed Kibana and monitored your nodes to know what they are limited by? In particular CPU or Memory?