Apache Spark Handling Skewed Data Apache Spark Handling Skewed Data hadoop hadoop

Apache Spark Handling Skewed Data


Yes you should use salted keys on the larger table (via randomization) and then replicate the smaller one / cartesian join it to the new salted one:

Here are a couple of suggestions:

Tresata skew join RDD https://github.com/tresata/spark-skewjoin

python skew join: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/

The tresata library looks like this:

import com.tresata.spark.skewjoin.Dsl._  // for the implicits   // skewjoin() method pulled in by the implicitsrdd1.skewJoin(rdd2, defaultPartitioner(rdd1, rdd2),   DefaultSkewReplication(1)).sortByKey(true).collect.toLis