Unusual Hadoop error - tasks get killed on their own Unusual Hadoop error - tasks get killed on their own hadoop hadoop

Unusual Hadoop error - tasks get killed on their own


There are three things to try:

Setting a Counter
If Hadoop sees a counter for the job progressing then it won't kill it (see Arockiaraj Durairaj's answer.) This seems to be the most elegant as it could allow you more insight into long running jobs and were the hangups may be.

Longer Task Timeouts
Hadoop jobs timeout after 10 minutes by default. Changing the timeout is somewhat brute force, but could work. Imagine analyzing audio files that are generally 5MB files (songs), but you have a few 50MB files (entire album). Hadoop stores an individual file per block. So if your HDFS block size is 64MB then a 5MB file and a 50 MB file would both require 1 block (64MB) (see here http://blog.cloudera.com/blog/2009/02/the-small-files-problem/, and here Small files and HDFS blocks.) However, the 5MB job would run faster than the 50MB job. Task timeout can be increased in the code (mapred.task.timeout) for the job per the answers to this similar question: How to fix "Task attempt_201104251139_0295_r_000006_0 failed to report status for 600 seconds."

Increase Task Attempts
Configure Hadoop to make more than the 4 default attempts (see Pradeep Gollakota's answer). This is the most brute force method of the three. Hadoop will attempt the job more times, but you could be masking an underlying issue (small servers, large data blocks, etc).


Can you try using counter(hadoop counter) in your reduce logic? It looks like hadoop is not able to determine whether your reduce program is running or hanging. It waits for a few minutes and kills it, even though your logic may be still executing.