pyspark error: AttributeError: 'SparkSession' object has no attribute 'parallelize'
SparkSession
is not a replacement for a SparkContext
but an equivalent of the SQLContext
. Just use it use the same way as you used to use SQLContext
:
spark.createDataFrame(...)
and if you ever have to access SparkContext
use sparkContext
attribute:
spark.sparkContext
so if you need SQLContext
for backwards compatibility you can:
SQLContext(sparkContext=spark.sparkContext, sparkSession=spark)
Whenever we are trying to create a DF from a backward-compatible object like RDD or a data frame created by spark session, you need to make your SQL context-aware about your session and context.
Like Ex:
If I create a RDD:
ss=SparkSession.builder.appName("vivek").master('local').config("k1","vi").getOrCreate()rdd=ss.sparkContext.parallelize([('Alex',21),('Bob',44)])
But if we wish to create a df from this RDD, we need to
sq=SQLContext(sparkContext=ss.sparkContext, sparkSession=ss)
then only we can use SQLContext with RDD/DF created by pandas.
schema = StructType([ StructField("name", StringType(), True), StructField("age", IntegerType(), True)])df=sq.createDataFrame(rdd,schema)df.collect()