Apache Pig: FLATTEN and parallel execution of reducers Apache Pig: FLATTEN and parallel execution of reducers hadoop hadoop

Apache Pig: FLATTEN and parallel execution of reducers


There is no surety if pig uses the configuration DEFAULT_PARALLEL value for every steps in the pig script. Try PARALLEL along with your specific join/group step which you feel taking time (In your case GROUP step).

 inputDataGrouped = GROUP inputData BY (group_name) PARALLEL 67;

If still it is not working then you might have to see your data for skewness issue.


I think there is a skewness in the data. Only a small number of mappers are producing exponentially large output. Look at the distribution of keys in your data. Like data contains few Groups with large number of records.


I tried "set default parallel" and "PARALLEL 100" but no luck. Pig still uses 1 reducer.

It turned out I have to generate a random number from 1 to 100 for each record and group these records by that random number.

We are wasting time on grouping, but it is much faster for me because now I can use more reducers.

Here is the code (SUBMITTER is my own UDF):

tmpRecord = FOREACH record GENERATE (int)(RANDOM()*100.0) as rnd, data;groupTmpRecord = GROUP tmpRecord BY rnd;result = FOREACH groupTmpRecord GENERATE FLATTEN(SUBMITTER(tmpRecord));