Handling record size more than 3GB in spark

scala hadoop apache-spark memory-management spark-dataframe

You probably have one huge line in your file containing the array. Here you get an exception because you are trying to build a CharBuffer that's too big (most likely an integer that became negative after going out of bound). Maximum array/string size in java is 2^31-1 (Integer.MAX_VALUE -1) (see this thread). You say that you have a 3GB record, with 1B per char, that make 3 billion characters which is more than 2^31 which is roughly equal to 2 billion.

TWhat you could do is a bit hacky but since you only have one key with a big array, it may work. Your json file might look like:

{  "key" : ["v0", "v1", "v2"... ]}

or like this but I think in your case it is the former:

{  "key" : [      "v0",       "v1",       "v2",      ...    ]}

Thus you could try changing the line delimiter used by hadoop to "," as here. Basically, they do it like this:

import org.apache.hadoop.io.LongWritableimport org.apache.hadoop.io.Textimport org.apache.hadoop.conf.Configurationimport org.apache.hadoop.mapreduce.lib.input.TextInputFormatdef nlFile(path: String) = {    val conf = new Configuration    conf.set("textinputformat.record.delimiter", ",")    sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)          .map(_._2.toString)}

Then you could read your array and would just have to remove the JSON brackets by yourself with something like this:

nlFile("...")  .map(_.replaceAll("^.*\\[", "").replaceAll("\\].*$",""))

Note that you would have to be more careful if your records can contain the characters "[" and "]" but here is the idea.

CodeHunter

Handling record size more than 3GB in spark

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last