Hadoop DistributedCache is deprecated - what is the preferred API? Hadoop DistributedCache is deprecated - what is the preferred API? hadoop hadoop

Hadoop DistributedCache is deprecated - what is the preferred API?


The APIs for the Distributed Cache can be found in the Job class itself. Check the documentation here: http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html The code should be something like

Job job = new Job();...job.addCacheFile(new Path(filename).toUri());

In your mapper code:

Path[] localPaths = context.getLocalCacheFiles();...


To expand on @jtravaglini, the preferred way of using DistributedCache for YARN/MapReduce 2 is as follows:

In your driver, use the Job.addCacheFile()

public int run(String[] args) throws Exception {    Configuration conf = getConf();    Job job = Job.getInstance(conf, "MyJob");    job.setMapperClass(MyMapper.class);    // ...    // Mind the # sign after the absolute file location.    // You will be using the name after the # sign as your    // file name in your Mapper/Reducer    job.addCacheFile(new URI("/user/yourname/cache/some_file.json#some"));    job.addCacheFile(new URI("/user/yourname/cache/other_file.json#other"));    return job.waitForCompletion(true) ? 0 : 1;}

And in your Mapper/Reducer, override the setup(Context context) method:

@Overrideprotected void setup(        Mapper<LongWritable, Text, Text, Text>.Context context)        throws IOException, InterruptedException {    if (context.getCacheFiles() != null            && context.getCacheFiles().length > 0) {        File some_file = new File("./some");        File other_file = new File("./other");        // Do things to these two files, like read them        // or parse as JSON or whatever.    }    super.setup(context);}


The new DistributedCache API for YARN/MR2 is found in the org.apache.hadoop.mapreduce.Job class.

   Job.addCacheFile()

Unfortunately, there aren't as of yet many comprehensive tutorial-style examples of this.

http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/Job.html#addCacheFile%28java.net.URI%29