Why caching small Spark RDDs takes big memory allocation in Yarn? Why caching small Spark RDDs takes big memory allocation in Yarn? hadoop hadoop

Why caching small Spark RDDs takes big memory allocation in Yarn?


Perhaps you can try to repartition(n) your RDD to a fewer n < 100 partitions before caching. A ~30GB RDD would probably fit into storage memory of ten 10GB executors. A good overview of Spark memory management can be found here. This way, only those executors that hold cached blocks will be "pinned" to your application, while the rest can be reclaimed by YARN via Spark dynamic allocation after spark.dynamicAllocation.executorIdleTimeout (default 60s).

Q: Is it possible to keep only the memory needed for the cached RDDs in each executors and release the rest, instead of holding always the initially set memory spark.executor.memory=10G?

When Spark uses YARN as its execution engine, YARN allocates the containers of a specified (by application) size -- at least spark.executor.memory+spark.executor.memoryOverhead, but may be even bigger in case of pyspark -- for all the executors. How much memory Spark actually uses inside a container becomes irrelevant, since the resources allocated to a container will be considered off-limits to other YARN applications.


Spark assumes that your data is equally distributed on all the executors and tasks. That's the reason why you set memory per task. So to make Spark to consume less memory, your data has to be evenly distributed:

  • If you are reading from Parquet files or CSVs, make sure that they have similar sizes. Running repartition() causes shuffling, which if the data is so skewed may cause other problems if executors don't have enough resources
  • Cache won't help to release memory on the executors because it doesn't redistribute the data
  • Can you please see under "Event Timeline" on the Stages "how big are the green bars?" Normally that's tied to the data distribution, so that's a way to see how much data is loaded (proportionally) on every task and how much they are doing. As all tasks have same memory assigned, you can see graphically if resources are wasted (in case there are mostly tiny bars and few big bars). A sample of wasted resources can be seen on the image below

enter image description here

There are different ways to create evenly distributed files for your process. I mention some possibilities, but for sure there are more:

  • Using Hive and DISTRIBUTE BY clause: you need to use a field that is equally balanced in order to create as many files (and with proper size) as expected
  • If the process creating those files is a Spark process reading from a DB, try to create as many connections as files you need and use a proper field to populate Spark partitions. That is achieved, as explained here and here with partitionColumn, lowerBound, upperBound and numPartitions properties
  • Repartition may work, but see if coalesce also make sense in your process or in the previous one generating the files you are reading from