pig join gets OutOfMemoryError in reducer when mapred.job.shuffle.input.buffer.percent=0.70

hadoop mapreduce out-of-memory apache-pig

Generally speaking, mapred.job.shuffle.input.buffer.percent=0.70 will not trigger OutOfMemory error, because this configuration ensures at most 70% of reducer's heap is used to store shuffled data. However, there are two scenarios that may cause OutOfMemory error in my practice.

1) Your program has combine() function and your combine() is memory-consuming. So the memory usage may exceed 70% of heap in shuffle phase, which may cause OutOfMemory error. But in general, Pig does not have combine() in Join operator.

2) JVM manages the memory itself and divides its heap into Eden, S0, S1 and old space. S0 and S1 are used for GC. In some cases, S0 + S1 + partial shuffled data (70% heap) > heap size. So the OutOfMemory occurs.

As you mentioned, when mapred.job.shuffle.input.buffer.percent=0.30, only 30% heap is used for storing shuffled data, heap is hard to be full. I need the job's detailed configuration (such as Xmx), data size, and log to give you a more specific answer.

Speaking of the SpillableMemoryManager. The default collection data structure in Pig is a “Bag”. Bags are spillable, meaning that if there is not enough memory to hold all the tuples in a bag in RAM, Pig will spill part of the bag to disk. This allows a large job to make progress, albeit slowly, rather than crashing from “out of memory” errors. (This paragraph is from pig's blog)

However, shuffle phase is controlled by Hadoop itself, so that SpillableMemoryManager does not take effect in shuffle phase (exactly speaking, it can take effect in combine() which is used in Group By. But Join does not have combine()). SpillableMemoryManager is normally used in map(), combine(), reduce() functions. This is why SplilableMemoryManager doesn't protect us from failing when the shuffle input buffer is on 70%. Note that Hadoop does not hold all the shuffled data in memory, it will merge partial shuffled data onto disk if they are too large.

CodeHunter

pig join gets OutOfMemoryError in reducer when mapred.job.shuffle.input.buffer.percent=0.70

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last