Apache Spark Handling Skewed Data
Yes you should use salted keys on the larger table (via randomization) and then replicate the smaller one / cartesian join it to the new salted one:
Here are a couple of suggestions:
Tresata skew join RDD https://github.com/tresata/spark-skewjoin
python skew join: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
The tresata
library looks like this:
import com.tresata.spark.skewjoin.Dsl._ // for the implicits // skewjoin() method pulled in by the implicitsrdd1.skewJoin(rdd2, defaultPartitioner(rdd1, rdd2), DefaultSkewReplication(1)).sortByKey(true).collect.toLis