pyspark error: AttributeError: 'SparkSession' object has no attribute 'parallelize' pyspark error: AttributeError: 'SparkSession' object has no attribute 'parallelize' hadoop hadoop

pyspark error: AttributeError: 'SparkSession' object has no attribute 'parallelize'


SparkSession is not a replacement for a SparkContext but an equivalent of the SQLContext. Just use it use the same way as you used to use SQLContext:

spark.createDataFrame(...)

and if you ever have to access SparkContext use sparkContext attribute:

spark.sparkContext

so if you need SQLContext for backwards compatibility you can:

SQLContext(sparkContext=spark.sparkContext, sparkSession=spark)


Whenever we are trying to create a DF from a backward-compatible object like RDD or a data frame created by spark session, you need to make your SQL context-aware about your session and context.

Like Ex:

If I create a RDD:

ss=SparkSession.builder.appName("vivek").master('local').config("k1","vi").getOrCreate()rdd=ss.sparkContext.parallelize([('Alex',21),('Bob',44)])

But if we wish to create a df from this RDD, we need to

sq=SQLContext(sparkContext=ss.sparkContext, sparkSession=ss)

then only we can use SQLContext with RDD/DF created by pandas.

schema = StructType([   StructField("name", StringType(), True),   StructField("age", IntegerType(), True)])df=sq.createDataFrame(rdd,schema)df.collect()