How to pivot on multiple columns in Spark SQL? How to pivot on multiple columns in Spark SQL? python python

How to pivot on multiple columns in Spark SQL?


Here's a non-UDF way involving a single pivot (hence, just a single column scan to identify all the unique dates).

dff = mydf.groupBy('id').pivot('day').agg(F.first('price').alias('price'),F.first('units').alias('unit'))

Here's the result (apologies for the non-matching ordering and naming):

+---+-------+------+-------+------+-------+------+-------+------+               | id|1_price|1_unit|2_price|2_unit|3_price|3_unit|4_price|4_unit|+---+-------+------+-------+------+-------+------+-------+------+|100|     23|    10|     45|    11|     67|    12|     78|    13||101|     23|    10|     45|    13|     67|    14|     78|    15||102|     23|    10|     45|    11|     67|    16|     78|    18|+---+-------+------+-------+------+-------+------+-------+------+

We just aggregate both on the price and the unit column after pivoting on the day.

If naming required as in question,

dff.select([F.col(c).name('_'.join(x for x in c.split('_')[::-1])) for c in dff.columns]).show()+---+-------+------+-------+------+-------+------+-------+------+| id|price_1|unit_1|price_2|unit_2|price_3|unit_3|price_4|unit_4|+---+-------+------+-------+------+-------+------+-------+------+|100|     23|    10|     45|    11|     67|    12|     78|    13||101|     23|    10|     45|    13|     67|    14|     78|    15||102|     23|    10|     45|    11|     67|    16|     78|    18|+---+-------+------+-------+------+-------+------+-------+------+


The solution in the question is the best I could get. The only improvement would be to cache the input dataset to avoid double scan, i.e.

mydf.cachepivot_udf(mydf,'price','units').show()


As in spark 1.6 version I think that's the only way because pivot takes only one column and there is second attribute values on which you can pass the distinct values of that column that will make your code run faster because otherwise spark has to run that for you, so yes that's the right way to do it.