Elasticsearch-hadoop & Elasticsearch-spark sql - Tracing of statements scan&scroll Elasticsearch-hadoop & Elasticsearch-spark sql - Tracing of statements scan&scroll elasticsearch elasticsearch

Elasticsearch-hadoop & Elasticsearch-spark sql - Tracing of statements scan&scroll


  1. As far as I know it is an expected behavior. All sources I know behave exactly the same way and intuitively it make sense. SparkSQL is designed for analytical queries and it make more sense to fetch data, cache and process locally. See also Does spark predicate pushdown work with JDBC?

  2. I don't think that conf.set("pushdown", "true") has any effect at all. If you want to configure connection specific settings it should be passed as an OPTION map as in the second case. Using es prefix should work as well

  3. This is strange indeed. Martin Senne reported a similar issue with PostgreSQL but I couldn't reproduce that.


After a discussion I had with Costin Leau on the elasticsearch discussion group, he pointed out the following and I ought sharing it with you :

There are a number of issues with your setup:

  1. You mention using Scala 2.11 but are using Scala 2.10. Note that if you want to pick your Scala version, elasticsearch-spark should be used, elasticsearch-hadoop provides binaries for Scala 2.10 only.

  2. The pushdown functionality is only available through Spark DataSource. If you are not using this type of declaration, the pushdown is not passed to ES (that's how Spark works). Hence declaring pushdown there is irrelevant.

  3. Notice that how all params in ES-Hadoop start with es. - the only exceptions are pushdown and location which are Spark DataSource specific (following Spark conventions as these are Spark specific features in a dedicated DS)

  4. Using a temporary table does count as a DataSource however you need to use pushdown there. If you don't, it gets activated by default hence why you see no difference between your runs; you haven't changed any relevant param.

  5. Count and other aggregations are not pushed down by Spark. There might be something in the future, according to the Databricks team, but there isn't anything currently. For count, you can do a quick call by using dataFrame.rdd.esCount. But it's an exceptional case.

  6. I'm not sure whether Thrift server actually counts as a DataSource since it loads data from Hive. You can double check this by enabling logging on the org.elasticsearch.hadoop.spark package to DEBUG. You should see whether the SQL does get translated to the DSL.

I hope this helps!