How to change a dataframe column from String type to Double type in PySpark?
There is no need for an UDF here. Column
already provides cast
method with DataType
instance :
from pyspark.sql.types import DoubleTypechangedTypedf = joindf.withColumn("label", joindf["show"].cast(DoubleType()))
or short string:
changedTypedf = joindf.withColumn("label", joindf["show"].cast("double"))
where canonical string names (other variations can be supported as well) correspond to simpleString
value. So for atomic types:
from pyspark.sql import types for t in ['BinaryType', 'BooleanType', 'ByteType', 'DateType', 'DecimalType', 'DoubleType', 'FloatType', 'IntegerType', 'LongType', 'ShortType', 'StringType', 'TimestampType']: print(f"{t}: {getattr(types, t)().simpleString()}")
BinaryType: binaryBooleanType: booleanByteType: tinyintDateType: dateDecimalType: decimal(10,0)DoubleType: doubleFloatType: floatIntegerType: intLongType: bigintShortType: smallintStringType: stringTimestampType: timestamp
and for example complex types
types.ArrayType(types.IntegerType()).simpleString()
'array<int>'
types.MapType(types.StringType(), types.IntegerType()).simpleString()
'map<string,int>'
Preserve the name of the column and avoid extra column addition by using the same name as input column:
from pyspark.sql.types import DoubleTypechangedTypedf = joindf.withColumn("show", joindf["show"].cast(DoubleType()))
Given answers are enough to deal with the problem but I want to share another way which may be introduced the new version of Spark (I am not sure about it) so given answer didn't catch it.
We can reach the column in spark statement with col("colum_name")
keyword:
from pyspark.sql.functions import colchangedTypedf = joindf.withColumn("show", col("show").cast("double"))