How to save dataframe to Elasticsearch in PySpark? How to save dataframe to Elasticsearch in PySpark? elasticsearch elasticsearch

How to save dataframe to Elasticsearch in PySpark?


tl;dr Use pyspark --packages org.elasticsearch:elasticsearch-hadoop:7.2.0 and use format("es") to reference the connector.


Quoting Installation from the official documentation of the Elasticsearch for Apache Hadoop product:

Just like other libraries, elasticsearch-hadoop needs to be available in Spark’s classpath.

And later in Supported Spark SQL versions:

elasticsearch-hadoop supports both version Spark SQL 1.3-1.6 and Spark SQL 2.0 through two different jars: elasticsearch-spark-1.x-<version>.jar and elasticsearch-hadoop-<version>.jar

elasticsearch-spark-2.0-<version>.jar supports Spark SQL 2.0

That looks like an issue with the document (as they use two different versions of the jar file), but does mean that you have to use the proper jar file on the CLASSPATH of your Spark application.

And later in the same document:

Spark SQL support is available under org.elasticsearch.spark.sql package.

That simply says that the format (in df.write.format('org.elasticsearch.spark.sql')) is correct.

Further down the document you can find that you could even use an alias df.write.format("es") (!)

I found Apache Spark section in the project's repository on GitHub more readable and current.


Update: The current ES-hadoop package as of June 2020 is 7.7.1, so I used pyspark --packages org.elasticsearch:elasticsearch-hadoop:7.7.1 instead.