aggregate function Count usage with groupBy in Spark
count()
can be used inside agg()
as groupBy
expression is same.
With Python
import pyspark.sql.functions as funcnew_log_df.cache().withColumn("timePeriod", encodeUDF(new_log_df["START_TIME"])) .groupBy("timePeriod") .agg( func.mean("DOWNSTREAM_SIZE").alias("Mean"), func.stddev("DOWNSTREAM_SIZE").alias("Stddev"), func.count(func.lit(1)).alias("Num Of Records") ) .show(20, False)
With Scala
import org.apache.spark.sql.functions._ //for count()new_log_df.cache().withColumn("timePeriod", encodeUDF(col("START_TIME"))) .groupBy("timePeriod") .agg( mean("DOWNSTREAM_SIZE").alias("Mean"), stddev("DOWNSTREAM_SIZE").alias("Stddev"), count(lit(1)).alias("Num Of Records") ) .show(20, false)
count(1)
will count the records by first column which is equal to count("timePeriod")
With Java
import static org.apache.spark.sql.functions.*;new_log_df.cache().withColumn("timePeriod", encodeUDF(col("START_TIME"))) .groupBy("timePeriod") .agg( mean("DOWNSTREAM_SIZE").alias("Mean"), stddev("DOWNSTREAM_SIZE").alias("Stddev"), count(lit(1)).alias("Num Of Records") ) .show(20, false)