Can I index a column in parquet file to make it join faster using Spark Can I index a column in parquet file to make it join faster using Spark hadoop hadoop

Can I index a column in parquet file to make it join faster using Spark


Spark joins use an object called a partitioner. If a DataFrame has no partitioner, executing a join will involve these steps:

  1. Create a new hash partitioner for the bigger side
  2. Shuffle both dataframes against this partitioner
  3. Now we've got the same key on the same node, so local join operations can finish the execution

You can optimize your join by addressing some of #1 and #2. I'd suggest that you repartition your bigger dataset by the join key (id):

// First DF which contain a few thousands itemsval dfExamples = sqlContext.parquetFile("file:///c:/temp/docVectors.parquet")// Second DF which contains 10 million itemsval dfDocVectors = sqlContext.parquetFile(docVectorsParquet)  .repartition($"id")// DataFrame of (id, vector)

Now, joining any smaller dataframe with dfDocVectors is going to be much faster -- the expensive shuffle step for the big dataframe has already been done.