Add column sum as new column in PySpark dataframe

python apache-spark pyspark spark-dataframe

This was not obvious. I see no row-based sum of the columns defined in the spark Dataframes API.

Version 2

This can be done in a fairly simple way:

newdf = df.withColumn('total', sum(df[col] for col in df.columns))

df.columns is supplied by pyspark as a list of strings giving all of the column names in the Spark Dataframe. For a different sum, you can supply any other list of column names instead.

I did not try this as my first solution because I wasn't certain how it would behave. But it works.

Version 1

This is overly complicated, but works as well.

You can do this:

use df.columns to get a list of the names of the columns
use that names list to make a list of the columns
pass that list to something that will invoke the column's overloaded add function in a fold-type functional manner

With python's reduce, some knowledge of how operator overloading works, and the pyspark code for columns here that becomes:

def column_add(a,b):     return  a.__add__(b)newdf = df.withColumn('total_col',          reduce(column_add, ( df[col] for col in df.columns ) ))

Note this is a python reduce, not a spark RDD reduce, and the parenthesis term in the second parameter to reduce requires the parenthesis because it is a list generator expression.

Tested, Works!

$ pyspark>>> df = sc.parallelize([{'a': 1, 'b':2, 'c':3}, {'a':8, 'b':5, 'c':6}, {'a':3, 'b':1, 'c':0}]).toDF().cache()>>> dfDataFrame[a: bigint, b: bigint, c: bigint]>>> df.columns['a', 'b', 'c']>>> def column_add(a,b):...     return a.__add__(b)...>>> df.withColumn('total', reduce(column_add, ( df[col] for col in df.columns ) )).collect()[Row(a=1, b=2, c=3, total=6), Row(a=8, b=5, c=6, total=19), Row(a=3, b=1, c=0, total=4)]

python apache-spark pyspark spark-dataframe

The most straight forward way of doing it is to use the expr function

from pyspark.sql.functions import *data = data.withColumn('total', expr("col1 + col2 + col3 + col4"))

python apache-spark pyspark spark-dataframe

The solution

newdf = df.withColumn('total', sum(df[col] for col in df.columns))

posted by @Paul works. Nevertheless I was getting the error, as many other as I have seen,

TypeError: 'Column' object is not callable

After some time I found the problem (at least in my case). The problem is that I previously imported some pyspark functions with the line

from pyspark.sql.functions import udf, col, count, sum, when, avg, mean, min

so the line imported the sum pyspark command while df.withColumn('total', sum(df[col] for col in df.columns)) is supposed to use the normal python sum function.

You can delete the reference of the pyspark function with del sum.

Otherwise in my case I changed the import to

import pyspark.sql.functions as F

and then referenced the functions as F.sum.

CodeHunter

Add column sum as new column in PySpark dataframe

Version 2

Version 1

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last