writing rdd from spark to Elastic Search fails
It looks like problem with pyspark calculations, not necessarly elasticsearch saving process. Ensure your RDDs are OK by:
- Performing
count()
on rdd1 (to "materialize" results) - Performing
count()
on rdd2
If counts are OK, try with caching results before saving into ES:
res2.cache()res2.count() # to fill the cacheres2.saveAsNewAPIHadoopFile(...
It the problem still appears, try to look at dead executors stderr and stdout (you can find them on Executors tab in SparkUI).
I also noticed the very small batch size in es_write_conf
, try increasing it to 500 or 1000 to get better performance.