How to creating a MapFile with Spark and access it?
Digging into the issue, I found that the Spark HashPartitioner and Hadoop HashPartitioner have different logic.
So the "brute force" solution I tried and works is the following.
Save the MapFile using rdd.repartitionAndSortWithinPArtitions(new org.apache.aprk.HashPartitioner(num_of_parititions)).saveAsNewAPIHadoopFile(....MapFileOutputFormat.class);
Lookup using:
- Reader[] readers = MapFileOutputFormat.getReaders(new Path(file),new Configuration());
- org.apache.aprk.HashPartitioner p = new org.apache.aprk.HashPartitioner(readers.length);
- readers[p.getPartition(key)].get(key,val);
This is "dirty" as the MapFile access is now bound to the Spark partitioner rather than the intuitive Hadoop HashPartitioner. I could implement a Spark partitioner that uses Hadoop HashPartitioner
to improve on though.
This also does not address the problem with slow access to the relatively large number of reducers. I could make this even 'dirtier' by generating the file part number from the partitioner but I am looking for a clean solution, so please post if there is a better approach to this problem.