aggregate function Count usage with groupBy in Spark aggregate function Count usage with groupBy in Spark java java

aggregate function Count usage with groupBy in Spark


count() can be used inside agg() as groupBy expression is same.

With Python

import pyspark.sql.functions as funcnew_log_df.cache().withColumn("timePeriod", encodeUDF(new_log_df["START_TIME"]))   .groupBy("timePeriod")  .agg(     func.mean("DOWNSTREAM_SIZE").alias("Mean"),      func.stddev("DOWNSTREAM_SIZE").alias("Stddev"),     func.count(func.lit(1)).alias("Num Of Records")   )  .show(20, False)

pySpark SQL functions doc

With Scala

import org.apache.spark.sql.functions._ //for count()new_log_df.cache().withColumn("timePeriod", encodeUDF(col("START_TIME")))   .groupBy("timePeriod")  .agg(     mean("DOWNSTREAM_SIZE").alias("Mean"),      stddev("DOWNSTREAM_SIZE").alias("Stddev"),     count(lit(1)).alias("Num Of Records")   )  .show(20, false)

count(1) will count the records by first column which is equal to count("timePeriod")

With Java

import static org.apache.spark.sql.functions.*;new_log_df.cache().withColumn("timePeriod", encodeUDF(col("START_TIME")))   .groupBy("timePeriod")  .agg(     mean("DOWNSTREAM_SIZE").alias("Mean"),      stddev("DOWNSTREAM_SIZE").alias("Stddev"),     count(lit(1)).alias("Num Of Records")   )  .show(20, false)