Updating a dataframe column in spark
While you cannot modify a column as such, you may operate on a column and return a new DataFrame reflecting that change. For that you'd first create a UserDefinedFunction
implementing the operation to apply and then selectively apply that function to the targeted column only. In Python:
from pyspark.sql.functions import UserDefinedFunctionfrom pyspark.sql.types import StringTypename = 'target_column'udf = UserDefinedFunction(lambda x: 'new_value', StringType())new_df = old_df.select(*[udf(column).alias(name) if column == name else column for column in old_df.columns])
new_df
now has the same schema as old_df
(assuming that old_df.target_column
was of type StringType
as well) but all values in column target_column
will be new_value
.
Commonly when updating a column, we want to map an old value to a new value. Here's a way to do that in pyspark without UDF's:
# update df[update_col], mapping old_value --> new_valuefrom pyspark.sql import functions as Fdf = df.withColumn(update_col, F.when(df[update_col]==old_value,new_value). otherwise(df[update_col])).
DataFrames
are based on RDDs. RDDs are immutable structures and do not allow updating elements on-site. To change values, you will need to create a new DataFrame by transforming the original one either using the SQL-like DSL or RDD operations like map
.
A highly recommended slide deck: Introducing DataFrames in Spark for Large Scale Data Science.