Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession? Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession? java java

Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?


sparkContext is a Scala implementation entry point and JavaSparkContext is a java wrapper of sparkContext.

SQLContext is entry point of SparkSQL which can be received from sparkContext.Prior to 2.x.x, RDD ,DataFrame and Data-set were three different data abstractions.Since Spark 2.x.x, All three data abstractions are unified and SparkSession is the unified entry point of Spark.

An additional note is , RDD meant for unstructured data, strongly typed data and DataFrames are for structured and loosely typed data. You can check

Is there any method to convert or create Context using Sparksession ?

yes. its sparkSession.sparkContext() and for SQL, sparkSession.sqlContext()

Can I completely replace all the Context using one single entry SparkSession ?

yes. you can get respective contexs from sparkSession.

Does all the functions in SQLContext, SparkContext,JavaSparkContext etc are added in SparkSession?

Not directly. you got to get respective context and make use of it.something like backward compatibility

How to use such function in SparkSession?

get respective context and make use of it.

How to create the following using SparkSession?

  1. RDD can be created from sparkSession.sparkContext.parallelize(???)
  2. JavaRDD same applies with this but in java implementation
  3. JavaPairRDD sparkSession.sparkContext.parallelize(???).map(//making your data as key-value pair here is one way)
  4. Dataset what sparkSession returns is Dataset if it is structured data.


Explanation from spark source code under branch-2.1

SparkContext: Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.

Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.

JavaSparkContext: A Java-friendly version of [[org.apache.spark.SparkContext]] that returns [[org.apache.spark.api.java.JavaRDD]]s and works with Java collections instead of Scala ones.

Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.

SQLContext: The entry point for working with structured data (rows and columns) in Spark 1.x.

As of Spark 2.0, this is replaced by [[SparkSession]]. However, we are keeping the class here for backward compatibility.

SparkSession: The entry point to programming Spark with the Dataset and DataFrame API.


I will talk about Spark version 2.x only.

SparkSession: It's a main entry point of your spark Application. To run any code on your spark, this is the first thing you should create.

from pyspark.sql import SparkSessionspark = SparkSession.builder.master("local").appName("Word Count")\.config("spark.some.config.option", "some-value")\.getOrCreate()

SparkContext: It's a inner Object (property) of SparkSession. It's used to interact with Low-Level API Through SparkContext you can create RDD, accumlator and Broadcast variables.

for most cases you won't need SparkContext. You can get SparkContext from SparkSession

val sc = spark.sparkContext