Hadoop: job runs okay on smaller set of data but fails with large dataset

java hadoop mapreduce hadoop-streaming

Since you have more than one reducer, your mappers will write outputs to the local disk on your slaves (as opposed to in HDFS). To be more precise, mappers don't actually write to the local disk immediately. Instead, they buffer the output in memory until it reaches a threshold (see "io.sort.mb" config setting). This process is called spilling. I think the problem is that when Hadoop is trying to spill to disk, your slaves don't have enough disk space to hold all the data generated by your mappers.

You mentioned each mapper produces a json string. Assuming it's ~100KB per doc (perhaps even bigger than this), it would amount to 278,262 x 100KB = ~28GB and both of your slaves have about 15GB of free space each.

The easiest way, I think, is to compress your immediate output from mappers using the following two config settings:

<property>  <name> mapreduce.map.output.compress</name>   <value>true</value></property><property>  <name>mapreduce.map.output.compress.codec</name>  <value>org.apache.hadoop.io.compress.GzipCodec</value></property>

Since your data is all JSON/text data, I think you will benefit from any compression algorithm supported by Hadoop.

As an FYI, if your document size grows way beyond 2 mil, you should consider adding more memory to your master. As a rule of thumb, each file/directory/block takes up about 150 bytes (or 300MB per 1 million files). In reality, however, I'd reserve 1GB per 1 million files.

java hadoop mapreduce hadoop-streaming

I ran into the same issue (on Mac OS X) and resolved it by setting the following value in mapred-site.xml

<name>mapred.child.ulimit</name><value>unlimited</value>

I then stopped the hadoop services bin/stop-all.sh, removed the /usr/local/tmp/ folder, formatted the namenode bin/hadoop namenode -format and started the hadoop services bin/start-all.sh

CodeHunter

Hadoop: job runs okay on smaller set of data but fails with large dataset

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last