How to change dataframe column names in pyspark? How to change dataframe column names in pyspark? python python

How to change dataframe column names in pyspark?


There are many ways to do that:

  • Option 1. Using selectExpr.

     data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)],                                    ["Name", "askdaosdka"]) data.show() data.printSchema() # Output #+-------+----------+ #|   Name|askdaosdka| #+-------+----------+ #|Alberto|         2| #| Dakota|         2| #+-------+----------+ #root # |-- Name: string (nullable = true) # |-- askdaosdka: long (nullable = true) df = data.selectExpr("Name as name", "askdaosdka as age") df.show() df.printSchema() # Output #+-------+---+ #|   name|age| #+-------+---+ #|Alberto|  2| #| Dakota|  2| #+-------+---+ #root # |-- name: string (nullable = true) # |-- age: long (nullable = true)
  • Option 2. Using withColumnRenamed, notice that this method allows you to "overwrite" the same column. For Python3, replace xrange with range.

     from functools import reduce oldColumns = data.schema.names newColumns = ["name", "age"] df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data) df.printSchema() df.show()
  • Option 3. usingalias, in Scala you can also use as.

     from pyspark.sql.functions import col data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age")) data.show() # Output #+-------+---+ #|   name|age| #+-------+---+ #|Alberto|  2| #| Dakota|  2| #+-------+---+
  • Option 4. Using sqlContext.sql, which lets you use SQL queries on DataFrames registered as tables.

     sqlContext.registerDataFrameAsTable(data, "myTable") df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable") df2.show() # Output #+-------+---+ #|   name|age| #+-------+---+ #|Alberto|  2| #| Dakota|  2| #+-------+---+


df = df.withColumnRenamed("colName", "newColName")\       .withColumnRenamed("colName2", "newColName2")

Advantage of using this way: With long list of columns you would like to change only few column names. This can be very convenient in these scenarios. Very useful when joining tables with duplicate column names.


If you want to change all columns names, try df.toDF(*cols)