Fine tuning PIG for local execution Fine tuning PIG for local execution hadoop hadoop

Fine tuning PIG for local execution


Pig's documentation makes it clear that local operation is intended to be run single-threaded, taking different code paths for certain functions that would otherwise use distributed sort. As a result, optimizing for Pig's local mode seems like the wrong solution to the presented problem.

Have you considered running a local, "pseudo-distributed" cluster instead of investing in full cluster setup? You can follow Hadoop's instructions for pseudo-distributed operation, then point Pig at localhost. This would have the desired result, at the expense of two-step startup and teardown.

You'll want to raise the number of default mappers and reducers to consume all cores available on your machine. Fortunately, this is reasonably well-documented (admittedly, in the cluster setup documentation); simply define mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum in your local copy of $HADOOP_HOME/conf/mapred-site.xml.