Hadoop MapReduce intermediate output Hadoop MapReduce intermediate output hadoop hadoop

Hadoop MapReduce intermediate output


keep.task.files.pattern parameter can be used to keep the intermediate files. The intermediate files have to be manually cleaned up once the Job has been completed. Since, this is a map/reduce task property, it has to be set in the configuration file and the jar file packaged again.


I don't think the MR framework provides any configuration to save intermediate map output files. Even if such a flag exists, it is not very useful because:

The intermediate output produced by the Maps can't be easily read/used as:
1) Key Value output is serialized before writing to intermediate files.
2) Metadata related to Key Value pairs (Key Length, Value Length, Partition#) is also written to these files (this metadata is in binary format)

An example location of these intermediate files are:
a) Intermediate Intermediate file (Spill output): /yarn/nm/usercache/root/appcache/application_1525687099554_0008/attempt_1525687099554_0008_m_000000_0_spill_0.out
b) Final Intermediate file (Merge Output): /yarn/nm/usercache/root/appcache/application_1525687099554_0008/output/attempt_1525687099554_0008_m_000001_0/file.out