Hadoop streaming job failure: Task process exit with nonzero status of 137
Exit code 137 is a typical sign of the infamous OOM killer. You can easily check it using dmesg
command for messages like this:
[2094250.428153] CPU: 23 PID: 28108 Comm: node Tainted: G C O 3.16.0-4-amd64 #1 Debian 3.16.7-ckt20-1+deb8u2[2094250.428155] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015[2094250.428156] ffff880773439400 ffffffff8150dacf ffff881328ea32f0 ffffffff8150b6e7[2094250.428159] ffff881328ea3808 0000000100000000 ffff88202cb30080 ffff881328ea32f0[2094250.428162] ffff88107fdf2f00 ffff88202cb30080 ffff88202cb30080 ffff881328ea32f0[2094250.428164] Call Trace:[2094250.428174] [<ffffffff8150dacf>] ? dump_stack+0x41/0x51[2094250.428177] [<ffffffff8150b6e7>] ? dump_header+0x76/0x1e8[2094250.428183] [<ffffffff8114044d>] ? find_lock_task_mm+0x3d/0x90[2094250.428186] [<ffffffff8114088d>] ? oom_kill_process+0x21d/0x370[2094250.428188] [<ffffffff8114044d>] ? find_lock_task_mm+0x3d/0x90[2094250.428193] [<ffffffff811a053a>] ? mem_cgroup_oom_synchronize+0x52a/0x590[2094250.428195] [<ffffffff8119fac0>] ? mem_cgroup_try_charge_mm+0xa0/0xa0[2094250.428199] [<ffffffff81141040>] ? pagefault_out_of_memory+0x10/0x80[2094250.428203] [<ffffffff81057505>] ? __do_page_fault+0x3c5/0x4f0[2094250.428208] [<ffffffff8109d017>] ? put_prev_entity+0x57/0x350[2094250.428211] [<ffffffff8109be86>] ? set_next_entity+0x56/0x70[2094250.428214] [<ffffffff810a2c61>] ? pick_next_task_fair+0x6e1/0x820[2094250.428219] [<ffffffff810115dc>] ? __switch_to+0x15c/0x570[2094250.428222] [<ffffffff81515ce8>] ? page_fault+0x28/0x30
You can easily if OOM is enabled:
$ cat /proc/sys/vm/overcommit_memory0
Basically OOM killer tries to kill process that eats largest part of memory. And you probably don't want to disable it:
The OOM killer can be completely disabled with the following command. This is not recommended for production environments, because if an out-of-memory condition does present itself, there could be unexpected behavior depending on the available system resources and configuration. This unexpected behavior could be anything from a kernel panic to a hang depending on the resources available to the kernel at the time of the OOM condition.
sysctl vm.overcommit_memory=2echo "vm.overcommit_memory=2" >> /etc/sysctl.conf
Same situation can happen if you use e.g. cgroups
for limiting memory. When process exceeds given limit it gets killed without warning.