Updating a dataframe column in spark Updating a dataframe column in spark python python

Updating a dataframe column in spark


While you cannot modify a column as such, you may operate on a column and return a new DataFrame reflecting that change. For that you'd first create a UserDefinedFunction implementing the operation to apply and then selectively apply that function to the targeted column only. In Python:

from pyspark.sql.functions import UserDefinedFunctionfrom pyspark.sql.types import StringTypename = 'target_column'udf = UserDefinedFunction(lambda x: 'new_value', StringType())new_df = old_df.select(*[udf(column).alias(name) if column == name else column for column in old_df.columns])

new_df now has the same schema as old_df (assuming that old_df.target_column was of type StringType as well) but all values in column target_column will be new_value.


Commonly when updating a column, we want to map an old value to a new value. Here's a way to do that in pyspark without UDF's:

# update df[update_col], mapping old_value --> new_valuefrom pyspark.sql import functions as Fdf = df.withColumn(update_col,    F.when(df[update_col]==old_value,new_value).    otherwise(df[update_col])).


DataFrames are based on RDDs. RDDs are immutable structures and do not allow updating elements on-site. To change values, you will need to create a new DataFrame by transforming the original one either using the SQL-like DSL or RDD operations like map.

A highly recommended slide deck: Introducing DataFrames in Spark for Large Scale Data Science.