Apache Pig: FLATTEN and parallel execution of reducers
There is no surety if pig uses the configuration DEFAULT_PARALLEL value for every steps in the pig script. Try PARALLEL along with your specific join/group step which you feel taking time (In your case GROUP step).
inputDataGrouped = GROUP inputData BY (group_name) PARALLEL 67;
If still it is not working then you might have to see your data for skewness issue.
I think there is a skewness in the data. Only a small number of mappers are producing exponentially large output. Look at the distribution of keys in your data. Like data contains few Groups with large number of records.
I tried "set default parallel" and "PARALLEL 100" but no luck. Pig still uses 1 reducer.
It turned out I have to generate a random number from 1 to 100 for each record and group these records by that random number.
We are wasting time on grouping, but it is much faster for me because now I can use more reducers.
Here is the code (SUBMITTER is my own UDF):
tmpRecord = FOREACH record GENERATE (int)(RANDOM()*100.0) as rnd, data;groupTmpRecord = GROUP tmpRecord BY rnd;result = FOREACH groupTmpRecord GENERATE FLATTEN(SUBMITTER(tmpRecord));