Apache Spark Handling Skewed Data

scala hadoop apache-spark spark-dataframe

Yes you should use salted keys on the larger table (via randomization) and then replicate the smaller one / cartesian join it to the new salted one:

Here are a couple of suggestions:

Tresata skew join RDD https://github.com/tresata/spark-skewjoin
python skew join: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/

The tresata library looks like this:

import com.tresata.spark.skewjoin.Dsl._  // for the implicits   // skewjoin() method pulled in by the implicitsrdd1.skewJoin(rdd2, defaultPartitioner(rdd1, rdd2),   DefaultSkewReplication(1)).sortByKey(true).collect.toLis

CodeHunter

Apache Spark Handling Skewed Data

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last