Apache Pig: FLATTEN and parallel execution of reducers

hadoop apache-pig

There is no surety if pig uses the configuration DEFAULT_PARALLEL value for every steps in the pig script. Try PARALLEL along with your specific join/group step which you feel taking time (In your case GROUP step).

 inputDataGrouped = GROUP inputData BY (group_name) PARALLEL 67;

If still it is not working then you might have to see your data for skewness issue.

hadoop apache-pig

I think there is a skewness in the data. Only a small number of mappers are producing exponentially large output. Look at the distribution of keys in your data. Like data contains few Groups with large number of records.

hadoop apache-pig

I tried "set default parallel" and "PARALLEL 100" but no luck. Pig still uses 1 reducer.

It turned out I have to generate a random number from 1 to 100 for each record and group these records by that random number.

We are wasting time on grouping, but it is much faster for me because now I can use more reducers.

Here is the code (SUBMITTER is my own UDF):

tmpRecord = FOREACH record GENERATE (int)(RANDOM()*100.0) as rnd, data;groupTmpRecord = GROUP tmpRecord BY rnd;result = FOREACH groupTmpRecord GENERATE FLATTEN(SUBMITTER(tmpRecord));

CodeHunter

Apache Pig: FLATTEN and parallel execution of reducers

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last