Fine tuning PIG for local execution
Pig's documentation makes it clear that local operation is intended to be run single-threaded, taking different code paths for certain functions that would otherwise use distributed sort. As a result, optimizing for Pig's local mode seems like the wrong solution to the presented problem.
Have you considered running a local, "pseudo-distributed" cluster instead of investing in full cluster setup? You can follow Hadoop's instructions for pseudo-distributed operation, then point Pig at localhost
. This would have the desired result, at the expense of two-step startup and teardown.
You'll want to raise the number of default mappers and reducers to consume all cores available on your machine. Fortunately, this is reasonably well-documented (admittedly, in the cluster setup documentation); simply define mapred.tasktracker.map.tasks.maximum
and mapred.tasktracker.reduce.tasks.maximum
in your local copy of $HADOOP_HOME/conf/mapred-site.xml
.