Spark - How to count number of records by key Spark - How to count number of records by key hadoop hadoop

Spark - How to count number of records by key


You're nearly there! All you need is a countByValue:

val countOfFemalesByCountry = femaleOnly.map(_(13)).countByValue()// Prints (Australia, 230), (America, 23242), etc.

(In your example, I assume you meant x(10) rather than x._10)

All together:

sc.textFile("/home/cloudera/desktop/file.txt")    .map(_.split(","))    .filter(x => x(10) == "Female")    .map(_(13))    .countByValue()


Have you considered manipulating your RDD using the Dataframes API ?

It looks like you're loading a CSV file, which you can do with spark-csv.

Then it's a simple matter (if your CSV is titled with the obvious column names) of:

import com.databricks.spark.csv._val countryGender = sqlContext.csvFile("/home/cloudera/desktop/file.txt") // already splits by field  .filter($"gender" === "Female")  .groupBy("country").count().show()

If you want to go deeper in this kind of manipulation, here's the guide:https://spark.apache.org/docs/latest/sql-programming-guide.html


You can easily create a key, it doesn't have to be in the file/database. For example:

val countryGender = sc.textFile("/home/cloudera/desktop/file.txt")                .map(_.split(","))                .filter(x => x._10 == "Female")                .map(x => (x._13, x._10))    // <<<< here you generate a new key                .groupByKey();