Hadoop: job runs okay on smaller set of data but fails with large dataset Hadoop: job runs okay on smaller set of data but fails with large dataset hadoop hadoop

Hadoop: job runs okay on smaller set of data but fails with large dataset


Since you have more than one reducer, your mappers will write outputs to the local disk on your slaves (as opposed to in HDFS). To be more precise, mappers don't actually write to the local disk immediately. Instead, they buffer the output in memory until it reaches a threshold (see "io.sort.mb" config setting). This process is called spilling. I think the problem is that when Hadoop is trying to spill to disk, your slaves don't have enough disk space to hold all the data generated by your mappers.

You mentioned each mapper produces a json string. Assuming it's ~100KB per doc (perhaps even bigger than this), it would amount to 278,262 x 100KB = ~28GB and both of your slaves have about 15GB of free space each.

The easiest way, I think, is to compress your immediate output from mappers using the following two config settings:

<property>  <name> mapreduce.map.output.compress</name>   <value>true</value></property><property>  <name>mapreduce.map.output.compress.codec</name>  <value>org.apache.hadoop.io.compress.GzipCodec</value></property>

Since your data is all JSON/text data, I think you will benefit from any compression algorithm supported by Hadoop.

As an FYI, if your document size grows way beyond 2 mil, you should consider adding more memory to your master. As a rule of thumb, each file/directory/block takes up about 150 bytes (or 300MB per 1 million files). In reality, however, I'd reserve 1GB per 1 million files.


I ran into the same issue (on Mac OS X) and resolved it by setting the following value in mapred-site.xml

<name>mapred.child.ulimit</name><value>unlimited</value>

I then stopped the hadoop services bin/stop-all.sh, removed the /usr/local/tmp/ folder, formatted the namenode bin/hadoop namenode -format and started the hadoop services bin/start-all.sh