Spark: Custom key compare method for reduceByKey Spark: Custom key compare method for reduceByKey hadoop hadoop

Spark: Custom key compare method for reduceByKey


You can't override the comparison of reduceByKey because it would not be able to use the fact that your data is often shuffled by key on separate executors all over your cluster. You can though change the key (and be aware that depending on the transformation/actions that you use this is likely to re-shuffle the data around).

There is a nifty method in RDD to do this called keyBy, so you can do something like this:

val data: RDD[MyClass] = ...    // Same code you have now.val byId2 = data.keyBy(_.id2)   //Assuming your ids are Longs, will produce a RDD[(Long,MyClass)]


If you are able to alter your class, then reduceByKey uses equals and hashCode. So, you can make sure those are defined and that will result in the correct comparisons being used.


Can't you just map the RDD so that the first element of the pair is the key you want to use?

case class MyClass(id1: Int, id2: Int)val rddToReduce: Rdd[(MyClass, String)] = ... //An RDD with MyClass as keyrddToReduce.map {  case (MyClass(id1, id2), value) => (id2, (id1, value)) //now the key is id2} .reduceByKey {  case (id1, value) => //do the combination here  ...} .map {  case (id2, (id1, combinedValue)) =>  (MyClass(id1, id2), combinedValue) //rearrange so that MyClass is the key again}