Reading JSON files into Spark Dataset and adding columns from a separate Map

json scala apache-spark apache-spark-sql apache-spark-dataset

If I understood correctly you want to correlate a KV from map with dataframes from json files.

I'll try to simplify the problem to only 3 files and 3 key values all ordered.

val kvs = Map("a" -> 1, "b" -> 2, "c" -> 3)val files = List("data0001.json", "data0002.json", "data0003.json")

Define a case class for handling more easy files, key, values

case class FileWithKV(fileName: String, key: String, value: Int)

Will zip the files and kvs

val filesWithKVs = files.zip(kvs)  .map(p => FileWithKV(p._1, p._2._1, p._2._2))

It will look like this

filesWithKVs: List[FileWithKV] = List(FileWithKV(data0001.json,a,1), FileWithKV(data0002.json,b,2), FileWithKV(data0003.json,c,3))

We start then with an initial dataframe, from the head of our collection and then will start folding left to construct the entire dataframe that will hold all the files, with all the columns dynamically generated from KV

val head = filesWithKVs.headval initialDf = spark.read.json(head.filename).withColumn(s"new_col_1", lit(head.key)) .withColumn(s"new_col_2", lit(head.value))

Now the folding part

val dfAll = filesWithKVs.tail.foldLeft(initialDf)((df, fileWithKV) => {    val newDf = spark    .read.json(fileWithKV.filename)    .withColumn(s"new_col_1", lit(fileWithKV.key))     .withColumn(s"new_col_2", lit(fileWithKV.value))    // union the dataframes to capture file by file, key value with key value    df.union(newDf)})

The dataframe will look like this, assuming that in the json files will be a column named bar and a value foo, for each of the 3 json files

+---+----------+----------+|bar|new_col_1 |new_col_2 |+---+----------+----------+|foo|         a|         1||foo|         b|         2||foo|         c|         3|+---+----------+----------+

json scala apache-spark apache-spark-sql apache-spark-dataset

I think you should create your own datasource for this. This new datasource would know about your particular folder structure and content structure.

CodeHunter

Reading JSON files into Spark Dataset and adding columns from a separate Map

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last