How can I efficiently read multiple json files into a Dataframe or JavaRDD?

java json apache-spark

You can use exactly the same code to read multiple JSON files. Just pass a path-to-a-directory / path-with-wildcards instead of path to a single file.

DataFrameReader also provides json method with a following signature:

json(jsonRDD: JavaRDD[String])

which can be used to parse JSON already loaded into JavaRDD.

java json apache-spark

To read multiple inputs in Spark, use wildcards. That's going to be true whether you're constructing a dataframe or an rdd.

context.read().json("/home/spark/articles/*.json")// or getting json out of s3context.read().json("s3n://bucket/articles/201510*/*.json")

java json apache-spark

function spark.read.json accepts list of file as a parameter.

spark.read.json(List_all_json file)

This will read all the files in the list and return a single data frame for all the information in the files.

CodeHunter

How can I efficiently read multiple json files into a Dataframe or JavaRDD?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last