Databricks - Pyspark - Handling nested json with a dynamic key Databricks - Pyspark - Handling nested json with a dynamic key json json

Databricks - Pyspark - Handling nested json with a dynamic key


The idea:

  • Step 1: Extract Header and TimeSeries separately.

  • Step 2: For each field in the TimeSeries object, extract the Amount and UnitPrice, together with the name of the field, stuff them into a struct.

  • Step 3: Merge all these structs into an array column, and explode it.

  • Step 4: Extract Timeseries, Amount and UnitPrice from the exploded column.

  • Step 5: Cross join with the Header row.

import pyspark.sql.functions as Fheader_df = df.select("Header.*")timeseries_df = df.select("TimeSeries.*")fieldNames = enumerate(timeseries_df.schema.fieldNames())cols = [F.struct(F.lit(name).alias("Timeseries"), col(name).getItem("Amount").alias("Amount"), col(name).getItem("UnitPrice").alias("UnitPrice")).alias("ts_" + str(idx)) for idx, name in fieldNames]combined = explode(array(cols)).alias("comb")timeseries = timeseries_df.select(combined).select('comb.Timeseries', 'comb.Amount', 'comb.UnitPrice')result = header_df.crossJoin(timeseries)result.show(truncate = False)

Output:

+-----+-----+-----+-----+-------------------------+------+---------+|Code1|Code2|Code3|Code4|Timeseries               |Amount|UnitPrice|+-----+-----+-----+-----+-------------------------+------+---------+|abc  |def  |ghi  |jkl  |2020-11-25T03:00:00+00:00|10000 |1000     ||abc  |def  |ghi  |jkl  |2020-11-26T03:00:00+00:00|10000 |1000     |+-----+-----+-----+-----+-------------------------+------+---------+