Elasticsearch + Apache Spark performance Elasticsearch + Apache Spark performance elasticsearch elasticsearch

Elasticsearch + Apache Spark performance


I figured out what was going on, basically, I was trying to manipulate the dataframe schema because I have some fields with a dot e.g user.firstname.This seems to cause a problem in the collect phase of spark. To resolve this, I had to just re-index my data so my fields no longer have dot but an underscore e.g user_firstname.


I'm afraid you can't perform a group by over 1.4 TB with only 120 GB of total RAM and achieve good performance.DF will try to load all data in memory/disk and only then it will perform group by. I don't think that at the moment spark/ES connector translates sql syntax in ES query language.