Retain keys with null values while writing JSON in spark

Apparently, spark does not provide any option to handle nulls. So following custom solution should work.

import com.fasterxml.jackson.module.scala.DefaultScalaModuleimport com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapperimport com.fasterxml.jackson.databind.ObjectMappercase class EventHeader(accept_language:String,app_id:String,app_name:String,client_ip_address:String,event_id: String,event_timestamp:String,offering_id:String,server_ip_address:String,server_timestamp:Long,topic_name:String,version:String)val ds = Seq(EventHeader(null,"App_ID",null,"IP","ID",null,"Offering","IP",1492565987565L,"Topic","1.0")).toDS()val ds1 = ds.mapPartitions(records => {val mapper = new ObjectMapper with ScalaObjectMappermapper.registerModule(DefaultScalaModule)records.map(mapper.writeValueAsString(_))})ds1.coalesce(1).write.text("hdfs://localhost:9000/user/dedupe_employee")

This will produce output as :

{"accept_language":null,"app_id":"App_ID","app_name":null,"client_ip_address":"IP","event_id":"ID","event_timestamp":null,"offering_id":"Offering","server_ip_address":"IP","server_timestamp":1492565987565,"topic_name":"Topic","version":"1.0"}

java json apache-spark apache-spark-sql

If you are on Spark 3, you can add

spark.sql.jsonGenerator.ignoreNullFields false

java json apache-spark apache-spark-sql

ignoreNullFields is an option to set when you want DataFrame converted to json file since Spark 3.

If you need Spark 2 (specifically PySpark 2.4.6), you can try converting DataFrame to rdd with Python dict format. And then call pyspark.rdd.saveTextFile to output json file to hdfs. The following example may help.

cols = ddp.columnsddp_ = ddp.rddddp_ = ddp_.map(lambda row: dict([(c, row[c]) for c in cols])ddp_ = ddp.repartition(1).saveAsTextFile(your_hdfs_file_path)

This should produce output file like,

{"accept_language": None, "app_id":"123", ...}{"accept_language": None, "app_id":"456", ...}

What's more, if you want to replace Python None with JSON null, you will need to dump every dict into json.

ddp_ = ddp_.map(lambda row: json.dumps(row, ensure.ascii=False))

CodeHunter

Retain keys with null values while writing JSON in spark

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last