Spark: Custom key compare method for reduceByKey
You can't override the comparison of reduceByKey
because it would not be able to use the fact that your data is often shuffled by key on separate executors all over your cluster. You can though change the key (and be aware that depending on the transformation/actions that you use this is likely to re-shuffle the data around).
There is a nifty method in RDD to do this called keyBy
, so you can do something like this:
val data: RDD[MyClass] = ... // Same code you have now.val byId2 = data.keyBy(_.id2) //Assuming your ids are Longs, will produce a RDD[(Long,MyClass)]
If you are able to alter your class, then reduceByKey
uses equals
and hashCode
. So, you can make sure those are defined and that will result in the correct comparisons being used.
Can't you just map
the RDD
so that the first element of the pair is the key you want to use?
case class MyClass(id1: Int, id2: Int)val rddToReduce: Rdd[(MyClass, String)] = ... //An RDD with MyClass as keyrddToReduce.map { case (MyClass(id1, id2), value) => (id2, (id1, value)) //now the key is id2} .reduceByKey { case (id1, value) => //do the combination here ...} .map { case (id2, (id1, combinedValue)) => (MyClass(id1, id2), combinedValue) //rearrange so that MyClass is the key again}