elasticsearch-spark connector size limit parameter is ignored in query

scala elasticsearch apache-spark elasticsearch-hadoop

Some parameters are actually ignored from the query by design, such as : from, size, fields, etc.

They are used internally inside the elasticsearch-spark connector.

Unfortunately this list of unsupported parameters isn't documented. But if you wish to use the size parameter you can always rely on the pushdown predicate and use the DataFrame/Dataset limit method.

So you ought using the Spark SQL DSL instead e.g :

val df = sqlContext.read.format("org.elasticsearch.spark.sql")                        .option("pushdown","true")                        .load("index_name/doc_type")                        .limit(10) // instead of size : 10

This query will return the first 10 documents returned by the match_all query that is used by default by the connector.

Note: The following isn't correct on any level.

This is actually on purpose. Since the connector does a parallel query, it also looks at the number of documents being returned so if the user specifies a parameter, it will overwrite it according to the es.scroll.limit setting (see the configuration option).

When you query elasticsearch it also run the query in parallel on all the index shards without overwriting them.

scala elasticsearch apache-spark elasticsearch-hadoop

If I understand this correctly, you are executing a count operation, which does not return any documents. Do you expect it to return 1 because you specified size: 1? That's not happening, which is by design.

Edited to add:This is the definition of count() in elasticsearch-hadoop:

override def count(): Long = {    val repo = new RestRepository(esCfg)    try {      return repo.count(true)    } finally {      repo.close()    }  }

It does not take the query into account at all, but considers the entire ES index as the RDD input.

scala elasticsearch apache-spark elasticsearch-hadoop

This is actually on purpose. Since the connector does a parallel query, it also looks at the number of documents being returned so if the user specifies a parameter, it will overwrite it according to the es.scroll.limit setting (see the configuration option).

In other words, if you want to control the size, do so through that setting as it will always take precedence.

Beware that this parameter applies per shard. So, if you have 5 shards you might bet up to fice hits if this parameter is set to 1.

See https://www.elastic.co/guide/en/elasticsearch/hadoop/master/configuration.html

CodeHunter

elasticsearch-spark connector size limit parameter is ignored in query

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last