Databricks - Pyspark - Handling nested json with a dynamic key
The idea:
Step 1: Extract Header and TimeSeries separately.
Step 2: For each field in the TimeSeries object, extract the
Amount
andUnitPrice
, together with thename
of the field, stuff them into a struct.Step 3: Merge all these structs into an array column, and explode it.
Step 4: Extract
Timeseries
,Amount
andUnitPrice
from the exploded column.Step 5: Cross join with the Header row.
import pyspark.sql.functions as Fheader_df = df.select("Header.*")timeseries_df = df.select("TimeSeries.*")fieldNames = enumerate(timeseries_df.schema.fieldNames())cols = [F.struct(F.lit(name).alias("Timeseries"), col(name).getItem("Amount").alias("Amount"), col(name).getItem("UnitPrice").alias("UnitPrice")).alias("ts_" + str(idx)) for idx, name in fieldNames]combined = explode(array(cols)).alias("comb")timeseries = timeseries_df.select(combined).select('comb.Timeseries', 'comb.Amount', 'comb.UnitPrice')result = header_df.crossJoin(timeseries)result.show(truncate = False)
Output:
+-----+-----+-----+-----+-------------------------+------+---------+|Code1|Code2|Code3|Code4|Timeseries |Amount|UnitPrice|+-----+-----+-----+-----+-------------------------+------+---------+|abc |def |ghi |jkl |2020-11-25T03:00:00+00:00|10000 |1000 ||abc |def |ghi |jkl |2020-11-26T03:00:00+00:00|10000 |1000 |+-----+-----+-----+-----+-------------------------+------+---------+