Hadoop Streaming Never Finishes Hadoop Streaming Never Finishes hadoop hadoop

Hadoop Streaming Never Finishes


I just got this error on a similar(simple) problem. For me, the error was due to the python script dying during execution. Similar to your problem, my script seemed to work just fine for a small subset of the problem, but just wouldn't work on Hadoop for the entire dataset, and that was due to flawed input. So, although it might not be why your script is dying, but you should probably add some sanity checks.


Check if the length of parts is what you expect it to be.

Check if parts is empty.

Also, you can go to the job tracker and see the exact error that caused Hadoop to stop execution. This will not give you the python stacktrace that you might be expecting, but is still helpful. The job tracker can usually be found at

http:// localhost : 50030/jobtracker.jsp

Also, change

#!/usr/bin/env

to

#!/usr/bin/python

This was because the machine running your script does not know what to do with it.It would probably just cause your computer to freeze up as well if you ran it with ./firstLetterMapper.py instead of python firstLetterMapper.py


The hadoop-streaming-x.y.z.jar should be in your $HADOOP_HOME which wasn't defined for me but should have been at /usr/lib/hadoop.

I think the Hadoop Streaming doc is pretty helpful about your issues with a python example.

First, your mapper nodes need a copy of the python file you wrote, so on the command line list it with the file option.
Second, if you're not using a reducer you don't need to define it.

$ hadoop jar /hadoop/hadoop-streaming-1.2.1.jar \  -D mapred.reduce.tasks=0 \  -input /stock -output /company_index \  -mapper firstLetterMapper.py \  -file /home/msknapp/workspace/stock/stock.mr/scripts/firstLetterMapper.py

Third your shabang would just run env on the file, you should change it to #!/usr/bin/python or #!/usr/bin/env python

That's probably what's causing env to give a non-zero exit value and therefore your mapper, which ran for ~30 seconds, retries with attempt 2, about 10 minutes later.