How to creating a MapFile with Spark and access it?

Digging into the issue, I found that the Spark HashPartitioner and Hadoop HashPartitioner have different logic.

So the "brute force" solution I tried and works is the following.

Save the MapFile using rdd.repartitionAndSortWithinPArtitions(new org.apache.aprk.HashPartitioner(num_of_parititions)).saveAsNewAPIHadoopFile(....MapFileOutputFormat.class);

Lookup using:

Reader[] readers = MapFileOutputFormat.getReaders(new Path(file),new Configuration());
org.apache.aprk.HashPartitioner p = new org.apache.aprk.HashPartitioner(readers.length);
readers[p.getPartition(key)].get(key,val);

This is "dirty" as the MapFile access is now bound to the Spark partitioner rather than the intuitive Hadoop HashPartitioner. I could implement a Spark partitioner that uses Hadoop HashPartitioner to improve on though.

This also does not address the problem with slow access to the relatively large number of reducers. I could make this even 'dirtier' by generating the file part number from the partitioner but I am looking for a clean solution, so please post if there is a better approach to this problem.

CodeHunter

How to creating a MapFile with Spark and access it?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last