Handling record size more than 3GB in spark Handling record size more than 3GB in spark hadoop hadoop

Handling record size more than 3GB in spark


You probably have one huge line in your file containing the array. Here you get an exception because you are trying to build a CharBuffer that's too big (most likely an integer that became negative after going out of bound). Maximum array/string size in java is 2^31-1 (Integer.MAX_VALUE -1) (see this thread). You say that you have a 3GB record, with 1B per char, that make 3 billion characters which is more than 2^31 which is roughly equal to 2 billion.

TWhat you could do is a bit hacky but since you only have one key with a big array, it may work. Your json file might look like:

{  "key" : ["v0", "v1", "v2"... ]}

or like this but I think in your case it is the former:

{  "key" : [      "v0",       "v1",       "v2",      ...    ]}

Thus you could try changing the line delimiter used by hadoop to "," as here. Basically, they do it like this:

import org.apache.hadoop.io.LongWritableimport org.apache.hadoop.io.Textimport org.apache.hadoop.conf.Configurationimport org.apache.hadoop.mapreduce.lib.input.TextInputFormatdef nlFile(path: String) = {    val conf = new Configuration    conf.set("textinputformat.record.delimiter", ",")    sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)          .map(_._2.toString)}

Then you could read your array and would just have to remove the JSON brackets by yourself with something like this:

nlFile("...")  .map(_.replaceAll("^.*\\[", "").replaceAll("\\].*$",""))

Note that you would have to be more careful if your records can contain the characters "[" and "]" but here is the idea.