Hadoop MapReduce intermediate output

logging hadoop mapreduce

keep.task.files.pattern parameter can be used to keep the intermediate files. The intermediate files have to be manually cleaned up once the Job has been completed. Since, this is a map/reduce task property, it has to be set in the configuration file and the jar file packaged again.

logging hadoop mapreduce

I don't think the MR framework provides any configuration to save intermediate map output files. Even if such a flag exists, it is not very useful because:

The intermediate output produced by the Maps can't be easily read/used as:
1) Key Value output is serialized before writing to intermediate files.
2) Metadata related to Key Value pairs (Key Length, Value Length, Partition#) is also written to these files (this metadata is in binary format)

An example location of these intermediate files are:
a) Intermediate Intermediate file (Spill output): /yarn/nm/usercache/root/appcache/application_1525687099554_0008/attempt_1525687099554_0008_m_000000_0_spill_0.out
b) Final Intermediate file (Merge Output): /yarn/nm/usercache/root/appcache/application_1525687099554_0008/output/attempt_1525687099554_0008_m_000001_0/file.out

CodeHunter

Hadoop MapReduce intermediate output

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last