Re-using A Schema from JSON within a Spark DataFrame using Scala Re-using A Schema from JSON within a Spark DataFrame using Scala json json

Re-using A Schema from JSON within a Spark DataFrame using Scala


I recently ran into this. I'm using Spark 2.0.2 so I don't know if this solution works with earlier versions.

import scala.util.Tryimport org.apache.spark.sql.Datasetimport org.apache.spark.sql.catalyst.parser.LegacyTypeStringParserimport org.apache.spark.sql.types.{DataType, StructType}/** Produce a Schema string from a Dataset */def serializeSchema(ds: Dataset[_]): String = ds.schema.json/** Produce a StructType schema object from a JSON string */def deserializeSchema(json: String): StructType = {    Try(DataType.fromJson(json)).getOrElse(LegacyTypeStringParser.parse(json)) match {        case t: StructType => t        case _ => throw new RuntimeException(s"Failed parsing StructType: $json")    }}

Note that the "deserialize" function I just copied from a private function in the Spark StructType object. I don't know how well it will be supported across versions.


Well, the error message should tell you everything you have to know here - StructType expects a sequence of fields as an argument. So in your case schema should look like this:

StructType(Seq(  StructField("comments", ArrayType(StructType(Seq(       // <- Seq[StructField]    StructField("comId", StringType, true),    StructField("content", StringType, true))), true), true),   StructField("createHour", StringType, true),  StructField("gid", StringType, true),  StructField("replies", ArrayType(StructType(Seq(        // <- Seq[StructField]    StructField("content", StringType, true),    StructField("repId", StringType, true))), true), true),  StructField("revisions", ArrayType(StructType(Seq(      // <- Seq[StructField]    StructField("modDate", StringType, true),    StructField("revId", StringType, true))),true), true)))