Can I index a column in parquet file to make it join faster using Spark
Spark joins use an object called a partitioner. If a DataFrame has no partitioner, executing a join will involve these steps:
- Create a new hash partitioner for the bigger side
- Shuffle both dataframes against this partitioner
- Now we've got the same key on the same node, so local join operations can finish the execution
You can optimize your join by addressing some of #1 and #2. I'd suggest that you repartition your bigger dataset by the join key (id):
// First DF which contain a few thousands itemsval dfExamples = sqlContext.parquetFile("file:///c:/temp/docVectors.parquet")// Second DF which contains 10 million itemsval dfDocVectors = sqlContext.parquetFile(docVectorsParquet) .repartition($"id")// DataFrame of (id, vector)
Now, joining any smaller dataframe with dfDocVectors is going to be much faster -- the expensive shuffle step for the big dataframe has already been done.