Elasticsearch spark reading slow Elasticsearch spark reading slow elasticsearch elasticsearch

Elasticsearch spark reading slow


This is not how slow it is supposed to be, and the answer could be found in the screenshot you shared:

The column Stages: Succeeded/Total in Spark UI shows only one task that runs the read operation, I don't think that this is what you would expect, otherwise, what's the point of having a whole cluster.

I have faced the same problem and it took me a while to figure out that Spark associates a task (partition) to each shard in the Elasticsearch index,

There we have our answer, to go faster we should parallelise the process, how to do so ? well by distributing our source index into multiple shards.

By default, Elasticsearch creates an Index with one shard, though, it is possible to personalised it as below:

PUT /index-name{     "settings": {     "index": {     "number_of_shards": x,       "number_of_replicas": xx     }  }}

The number of shards could be higher than the number of Elastic nodes, this is all transparent to Spark.If the index already exists, try creating a new inex and then use the Elasticsearch Reindex API

I hope this solved your problem.