AttributeError: 'DataFrame' object has no attribute 'map' AttributeError: 'DataFrame' object has no attribute 'map' python python

AttributeError: 'DataFrame' object has no attribute 'map'


You can't map a dataframe, but you can convert the dataframe to an RDD and map that by doing spark_df.rdd.map(). Prior to Spark 2.0, spark_df.map would alias to spark_df.rdd.map(). With Spark 2.0, you must explicitly call .rdd first.


You can use df.rdd.map(), as DataFrame does not have map or flatMap, but be aware of the implications of using df.rdd:

Converting to RDD breaks Dataframe lineage, there is no predicate pushdown, no column prunning, no SQL plan and less efficient PySpark transformations.

What should you do instead?

Keep in mind that the high-level DataFrame API is equipped with many alternatives. First, you can use select or selectExpr.

Another example is using explode instead of flatMap(which existed in RDD):

df.select($"name",explode($"knownLanguages"))    .show(false)

Result:

+-------+------+|name   |col   |+-------+------+|James  |Java  ||James  |Scala ||Michael|Spark ||Michael|Java  ||Michael|null  ||Robert |CSharp||Robert |      |+-------+------+

You can also use withColumn or UDF, depending on the use-case, or another option in the DataFrame API.