How to creating a MapFile with Spark and access it? How to creating a MapFile with Spark and access it? hadoop hadoop

How to creating a MapFile with Spark and access it?


Digging into the issue, I found that the Spark HashPartitioner and Hadoop HashPartitioner have different logic.

So the "brute force" solution I tried and works is the following.

Save the MapFile using rdd.repartitionAndSortWithinPArtitions(new org.apache.aprk.HashPartitioner(num_of_parititions)).saveAsNewAPIHadoopFile(....MapFileOutputFormat.class);

Lookup using:

  • Reader[] readers = MapFileOutputFormat.getReaders(new Path(file),new Configuration());
  • org.apache.aprk.HashPartitioner p = new org.apache.aprk.HashPartitioner(readers.length);
  • readers[p.getPartition(key)].get(key,val);

This is "dirty" as the MapFile access is now bound to the Spark partitioner rather than the intuitive Hadoop HashPartitioner. I could implement a Spark partitioner that uses Hadoop HashPartitioner to improve on though.

This also does not address the problem with slow access to the relatively large number of reducers. I could make this even 'dirtier' by generating the file part number from the partitioner but I am looking for a clean solution, so please post if there is a better approach to this problem.