Elasticsearch-hadoop & Elasticsearch-spark sql - Tracing of statements scan&scroll
As far as I know it is an expected behavior. All sources I know behave exactly the same way and intuitively it make sense. SparkSQL is designed for analytical queries and it make more sense to fetch data, cache and process locally. See also Does spark predicate pushdown work with JDBC?
I don't think that
conf.set("pushdown", "true")
has any effect at all. If you want to configure connection specific settings it should be passed as anOPTION
map as in the second case. Usinges
prefix should work as wellThis is strange indeed. Martin Senne reported a similar issue with PostgreSQL but I couldn't reproduce that.
After a discussion I had with Costin Leau on the elasticsearch discussion group, he pointed out the following and I ought sharing it with you :
There are a number of issues with your setup:
You mention using Scala 2.11 but are using Scala 2.10. Note that if you want to pick your Scala version,
elasticsearch-spark
should be used,elasticsearch-hadoop
provides binaries for Scala 2.10 only.The pushdown functionality is only available through Spark DataSource. If you are not using this type of declaration, the
pushdown
is not passed to ES (that's how Spark works). Hence declaringpushdown
there is irrelevant.Notice that how all params in ES-Hadoop start with
es.
- the only exceptions arepushdown
andlocation
which are Spark DataSource specific (following Spark conventions as these are Spark specific features in a dedicated DS)Using a temporary table does count as a DataSource however you need to use
pushdown
there. If you don't, it gets activated by default hence why you see no difference between your runs; you haven't changed any relevant param.Count and other aggregations are not pushed down by Spark. There might be something in the future, according to the Databricks team, but there isn't anything currently. For count, you can do a quick call by using
dataFrame.rdd.esCount
. But it's an exceptional case.I'm not sure whether Thrift server actually counts as a DataSource since it loads data from Hive. You can double check this by enabling logging on the
org.elasticsearch.hadoop.spark
package to DEBUG. You should see whether the SQL does get translated to the DSL.
I hope this helps!