How can I efficiently read multiple json files into a Dataframe or JavaRDD? How can I efficiently read multiple json files into a Dataframe or JavaRDD? json json

How can I efficiently read multiple json files into a Dataframe or JavaRDD?


You can use exactly the same code to read multiple JSON files. Just pass a path-to-a-directory / path-with-wildcards instead of path to a single file.

DataFrameReader also provides json method with a following signature:

json(jsonRDD: JavaRDD[String])

which can be used to parse JSON already loaded into JavaRDD.


To read multiple inputs in Spark, use wildcards. That's going to be true whether you're constructing a dataframe or an rdd.

context.read().json("/home/spark/articles/*.json")// or getting json out of s3context.read().json("s3n://bucket/articles/201510*/*.json")


function spark.read.json accepts list of file as a parameter.

spark.read.json(List_all_json file)

This will read all the files in the list and return a single data frame for all the information in the files.