writing rdd from spark to Elastic Search fails

hadoop elasticsearch apache-spark databricks

It looks like problem with pyspark calculations, not necessarly elasticsearch saving process. Ensure your RDDs are OK by:

If counts are OK, try with caching results before saving into ES:

res2.cache()res2.count() # to fill the cacheres2.saveAsNewAPIHadoopFile(...

It the problem still appears, try to look at dead executors stderr and stdout (you can find them on Executors tab in SparkUI).

I also noticed the very small batch size in es_write_conf, try increasing it to 500 or 1000 to get better performance.

CodeHunter