Elasticsearch spark reading slow

scala apache-spark elasticsearch elasticsearch-spark

This is not how slow it is supposed to be, and the answer could be found in the screenshot you shared:

The column Stages: Succeeded/Total in Spark UI shows only one task that runs the read operation, I don't think that this is what you would expect, otherwise, what's the point of having a whole cluster.

I have faced the same problem and it took me a while to figure out that Spark associates a task (partition) to each shard in the Elasticsearch index,

There we have our answer, to go faster we should parallelise the process, how to do so ? well by distributing our source index into multiple shards.

By default, Elasticsearch creates an Index with one shard, though, it is possible to personalised it as below:

PUT /index-name{     "settings": {     "index": {     "number_of_shards": x,       "number_of_replicas": xx     }  }}

The number of shards could be higher than the number of Elastic nodes, this is all transparent to Spark.If the index already exists, try creating a new inex and then use the Elasticsearch Reindex API

I hope this solved your problem.

CodeHunter

Elasticsearch spark reading slow

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last