'PipelinedRDD' object has no attribute 'toDF' in PySpark
toDF
method is a monkey patch executed inside SparkSession
(SQLContext
constructor in 1.x) constructor so to be able to use it you have to create a SQLContext
(or SparkSession
) first:
# SQLContext or HiveContext in Spark 1.xfrom pyspark.sql import SparkSessionfrom pyspark import SparkContextsc = SparkContext()rdd = sc.parallelize([("a", 1)])hasattr(rdd, "toDF")## Falsespark = SparkSession(sc)hasattr(rdd, "toDF")## Truerdd.toDF().show()## +---+---+## | _1| _2|## +---+---+## | a| 1|## +---+---+
Not to mention you need a SQLContext
or SparkSession
to work with DataFrames
in the first place.
Make sure you have spark session too.
sc = SparkContext("local", "first app")spark = SparkSession(sc)