Hadoop DistributedCache is deprecated - what is the preferred API?
The APIs for the Distributed Cache can be found in the Job class itself. Check the documentation here: http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html The code should be something like
Job job = new Job();...job.addCacheFile(new Path(filename).toUri());
In your mapper code:
Path[] localPaths = context.getLocalCacheFiles();...
To expand on @jtravaglini, the preferred way of using DistributedCache
for YARN/MapReduce 2 is as follows:
In your driver, use the Job.addCacheFile()
public int run(String[] args) throws Exception { Configuration conf = getConf(); Job job = Job.getInstance(conf, "MyJob"); job.setMapperClass(MyMapper.class); // ... // Mind the # sign after the absolute file location. // You will be using the name after the # sign as your // file name in your Mapper/Reducer job.addCacheFile(new URI("/user/yourname/cache/some_file.json#some")); job.addCacheFile(new URI("/user/yourname/cache/other_file.json#other")); return job.waitForCompletion(true) ? 0 : 1;}
And in your Mapper/Reducer, override the setup(Context context)
method:
@Overrideprotected void setup( Mapper<LongWritable, Text, Text, Text>.Context context) throws IOException, InterruptedException { if (context.getCacheFiles() != null && context.getCacheFiles().length > 0) { File some_file = new File("./some"); File other_file = new File("./other"); // Do things to these two files, like read them // or parse as JSON or whatever. } super.setup(context);}
The new DistributedCache API for YARN/MR2 is found in the org.apache.hadoop.mapreduce.Job
class.
Job.addCacheFile()
Unfortunately, there aren't as of yet many comprehensive tutorial-style examples of this.