Column alias after groupBy in pyspark Column alias after groupBy in pyspark python python

Column alias after groupBy in pyspark


You can use agg instead of calling max method:

from pyspark.sql.functions import maxjoined_df.groupBy(temp1.datestamp).agg(max("diff").alias("maxDiff"))

Similarly in Scala

import org.apache.spark.sql.functions.maxjoined_df.groupBy($"datestamp").agg(max("diff").alias("maxDiff"))

or

joined_df.groupBy($"datestamp").agg(max("diff").as("maxDiff"))


This is because you are aliasing the whole DataFrame object, not Column. Here's an example how to alias the Column only:

import pyspark.sql.functions as funcgrpdf = joined_df \    .groupBy(temp1.datestamp) \    .max('diff') \    .select(func.col("max(diff)").alias("maxDiff"))


In addition to the answers already here, the following are also convenient ways if you know the name of the aggregated column, where you don't have to import from pyspark.sql.functions:

1

grouped_df = joined_df.groupBy(temp1.datestamp) \                      .max('diff') \                      .selectExpr('max(diff) AS maxDiff')

See docs for info on .selectExpr()

2

grouped_df = joined_df.groupBy(temp1.datestamp) \                      .max('diff') \                      .withColumnRenamed('max(diff)', 'maxDiff')

See docs for info on .withColumnRenamed()

This answer here goes into more detail: https://stackoverflow.com/a/34077809