Apache Spark Effects of Driver Memory, Executor Memory, Driver Memory Overhead and Executor Memory Overhead on success of job runs Apache Spark Effects of Driver Memory, Executor Memory, Driver Memory Overhead and Executor Memory Overhead on success of job runs hadoop hadoop

Apache Spark Effects of Driver Memory, Executor Memory, Driver Memory Overhead and Executor Memory Overhead on success of job runs


All your cases use

--executor-cores 1

It is the best practice to go above 1. And don't go above 5.From our experience and from Spark developers recommendation.

E.g.http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/:

A rough guess is that at most five tasks per executor can achieve full write throughput, so it’s good to keep the number of cores per executor below that number

I can't find now reference where it was recommended to go above 1 cores per executor. But the idea is that running multiple tasks in the same executor gives you ability to share some common memory regions so it actually saves memory.

Start with --executor-cores 2, double --executor-memory (because --executor-cores tells also how many tasks one executor will run concurently), and see what it does for you. Your environment is compact in terms of available memory, so going to 3 or 4 will give you even better memory utilization.

We use Spark 1.5 and stopped using --executor-cores 1 quite some time ago as it was giving GC problems; it looks also like a Spark bug, because just giving more memory wasn't helping as much as just switching to having more tasks per container. I guess tasks in the same executor may peak its memory consumption at different times, so you don't waste/don't have to overprovision memory just to make it work.

Another benefit is that Spark's shared variables (accumulators and broadcast variables) will have just one copy per executor, not per task - so switching to multiple tasks per executor is a direct memory saving right there. Even if you don't use Spark shared variables explicitly, Spark very likely creates them internally anyway. For example, if you join two tables through Spark SQL, Spark's CBO may decide to broadcast a smaller table (or a smaller dataframe) across to make join run faster.

http://spark.apache.org/docs/latest/programming-guide.html#shared-variables