Apache Spark: Get number of records per partition

scala apache-spark hadoop apache-spark-sql partitioning

I'd use built-in function. It should be as efficient as it gets:

import org.apache.spark.sql.functions.spark_partition_iddf.groupBy(spark_partition_id).count

scala apache-spark hadoop apache-spark-sql partitioning

You can get the number of records per partition like this :

df  .rdd  .mapPartitionsWithIndex{case (i,rows) => Iterator((i,rows.size))}  .toDF("partition_number","number_of_records")  .show

But this will also launch a Spark Job by itself (because the file must be read by spark to get the number of records).

Spark could may also read hive table statistics, but I don't know how to display those metadata..

scala apache-spark hadoop apache-spark-sql partitioning

For future PySpark users:

from pyspark.sql.functions  import spark_partition_idrawDf.withColumn("partitionId", spark_partition_id()).groupBy("partitionId").count().show()

CodeHunter