Spark RDD to DataFrame python Spark RDD to DataFrame python python python

Spark RDD to DataFrame python


See,

There are two ways to convert an RDD to DF in Spark.

toDF() and createDataFrame(rdd, schema)

I will show you how you can do that dynamically.

toDF()

The toDF() command gives you the way to convert an RDD[Row] to a Dataframe. The point is, the object Row() can receive a **kwargs argument. So, there is an easy way to do that.

from pyspark.sql.types import Row#here you are going to create a functiondef f(x):    d = {}    for i in range(len(x)):        d[str(i)] = x[i]    return d#Now populate thatdf = rdd.map(lambda x: Row(**f(x))).toDF()

This way you are going to be able to create a dataframe dynamically.

createDataFrame(rdd, schema)

Other way to do that is creating a dynamic schema. How?

This way:

from pyspark.sql.types import StructTypefrom pyspark.sql.types import StructFieldfrom pyspark.sql.types import StringTypeschema = StructType([StructField(str(i), StringType(), True) for i in range(32)])df = sqlContext.createDataFrame(rdd, schema)

This second way is cleaner to do that...

So this is how you can create dataframes dynamically.


I liked Arun's answer better but there is a tiny problem and I could not comment or edit the answer. sparkContext does not have createDeataFrame, sqlContext does (as Thiago mentioned). So:

from pyspark.sql import SQLContext# assuming the spark environemnt is set and sc is spark.sparkContext sqlContext = SQLContext(sc)schemaPeople = sqlContext.createDataFrame(RDDName)schemaPeople.createOrReplaceTempView("RDDName")


Try if that works

sc = spark.sparkContext# Infer the schema, and register the DataFrame as a table.schemaPeople = spark.createDataFrame(RddName)schemaPeople.createOrReplaceTempView("RddName")