Reading Json file using Apache Spark Reading Json file using Apache Spark hadoop hadoop

Reading Json file using Apache Spark


You will need to format the JSON to one line if you have to read this JSON. This is a multi line JSON and hence is not being read and loaded properly (One Object one Row)

Quoting the JSON API :

Loads a JSON file (one object per line) and returns the result as a DataFrame.

{"glossary":{"title":"example glossary","GlossDiv":{"title":"S","GlossList":{"GlossEntry":{"ID":"SGML","SortAs":"SGML","GlossTerm":"Standard Generalized Markup Language","Acronym":"SGML","Abbrev":"ISO 8879:1986","GlossDef":{"para":"A meta-markup language, used to create markup languages such as DocBook.","GlossSeeAlso":["GML","XML"]},"GlossSee":"markup"}}}}}

I just tried it on the shell , it should work from the code as well the same way (I had the same corrupted record error when i read a multi line JSON)

scala> val df = spark.read.json("C:/DevelopmentTools/data.json")df: org.apache.spark.sql.DataFrame = [glossary: struct<GlossDiv: struct<GlossList: struct<GlossEntry: struct<Abbrev: string, Acronym: string ... 5 more fields>>, title: string>, title: string>]scala>

Edits :

You can get the values out from that data frame using any action , for example

scala> df.select(df("glossary.GlossDiv.GlossList.GlossEntry.GlossTerm")).show()+--------------------+|           GlossTerm|+--------------------+|Standard Generali...|+--------------------+scala>

You should be able to do it from your code as well


Just make sure your json is in one line you are reading nested json so, if you already did this you are successfully loaded the json you are showing it in wrong way its nested json so you cant directly show, like if you want the title data of GlossDiv you can show it as follow

SparkSession session = SparkSession.builder().master("local").appName("jsonreader").getOrCreate();Dataset<Row> list = session.read().json("/Users/hadoop/Desktop/sample.json");list.select("glossary.GlossDiv.title") .show


Try:

session.read().json(session.sparkContext.wholeTextFiles("..."));