Apache Spark: The number of cores vs. the number of executors Apache Spark: The number of cores vs. the number of executors hadoop hadoop

Apache Spark: The number of cores vs. the number of executors


To hopefully make all of this a little more concrete, here’s a worked example of configuring a Spark app to use as much of the cluster aspossible: Imagine a cluster with six nodes running NodeManagers, eachequipped with 16 cores and 64GB of memory. The NodeManager capacities,yarn.nodemanager.resource.memory-mb andyarn.nodemanager.resource.cpu-vcores, should probably be set to 63 *1024 = 64512 (megabytes) and 15 respectively. We avoid allocating 100%of the resources to YARN containers because the node needs someresources to run the OS and Hadoop daemons. In this case, we leave agigabyte and a core for these system processes. Cloudera Manager helpsby accounting for these and configuring these YARN propertiesautomatically.

The likely first impulse would be to use --num-executors 6--executor-cores 15 --executor-memory 63G. However, this is the wrong approach because:

63GB + the executor memory overhead won’t fit within the 63GB capacityof the NodeManagers. The application master will take up a core on oneof the nodes, meaning that there won’t be room for a 15-core executoron that node. 15 cores per executor can lead to bad HDFS I/Othroughput.

A better option would be to use --num-executors 17--executor-cores 5 --executor-memory 19G. Why?

This config results in three executors on all nodes except for the onewith the AM, which will have two executors.--executor-memory was derived as (63/3 executors per node) = 21. 21 * 0.07 = 1.47. 21 – 1.47 ~ 19.

The explanation was given in an article in Cloudera's blog, How-to: Tune Your Apache Spark Jobs (Part 2).


As you run your spark app on top of HDFS, according to Sandy Ryza

I’ve noticed that the HDFS client has trouble with tons of concurrent threads. A rough guess is that at most five tasks per executor can achieve full write throughput, so it’s good to keep the number of cores per executor below that number.

So I believe that your first configuration is slower than third one is because of bad HDFS I/O throughput


Short answer: I think tgbaggio is right. You hit HDFS throughput limits on your executors.

I think the answer here may be a little simpler than some of the recommendations here.

The clue for me is in the cluster network graph. For run 1 the utilization is steady at ~50 M bytes/s. For run 3 the steady utilization is doubled, around 100 M bytes/s.

From the cloudera blog post shared by DzOrd, you can see this important quote:

I’ve noticed that the HDFS client has trouble with tons of concurrent threads. A rough guess is that at most five tasks per executor can achieve full write throughput, so it’s good to keep the number of cores per executor below that number.

So, let's do a few calculations see what performance we expect if that is true.


Run 1: 19 GB, 7 cores, 3 executors

  • 3 executors x 7 threads = 21 threads
  • with 7 cores per executor, we expect limited IO to HDFS (maxes out at ~5 cores)
  • effective throughput ~= 3 executors x 5 threads = 15 threads

Run 3: 4 GB, 2 cores, 12 executors

  • 2 executors x 12 threads = 24 threads
  • 2 cores per executor, so hdfs throughput is ok
  • effective throughput ~= 12 executors x 2 threads = 24 threads

If the job is 100% limited by concurrency (the number of threads). We would expect runtime to be perfectly inversely correlated with the number of threads.

ratio_num_threads = nthread_job1 / nthread_job3 = 15/24 = 0.625inv_ratio_runtime = 1/(duration_job1 / duration_job3) = 1/(50/31) = 31/50 = 0.62

So ratio_num_threads ~= inv_ratio_runtime, and it looks like we are network limited.

This same effect explains the difference between Run 1 and Run 2.


Run 2: 19 GB, 4 cores, 3 executors

  • 3 executors x 4 threads = 12 threads
  • with 4 cores per executor, ok IO to HDFS
  • effective throughput ~= 3 executors x 4 threads = 12 threads

Comparing the number of effective threads and the runtime:

ratio_num_threads = nthread_job2 / nthread_job1 = 12/15 = 0.8inv_ratio_runtime = 1/(duration_job2 / duration_job1) = 1/(55/50) = 50/55 = 0.91

It's not as perfect as the last comparison, but we still see a similar drop in performance when we lose threads.

Now for the last bit: why is it the case that we get better performance with more threads, esp. more threads than the number of CPUs?

A good explanation of the difference between parallelism (what we get by dividing up data onto multiple CPUs) and concurrency (what we get when we use multiple threads to do work on a single CPU) is provided in this great post by Rob Pike: Concurrency is not parallelism.

The short explanation is that if a Spark job is interacting with a file system or network the CPU spends a lot of time waiting on communication with those interfaces and not spending a lot of time actually "doing work". By giving those CPUs more than 1 task to work on at a time, they are spending less time waiting and more time working, and you see better performance.