Create DataFrame from list of tuples using pyspark
Hey could you next time provide a working example. That would be easier.
The way how your RDD is presented is basically weird to create a DataFrame. This is how you create a DF according to Spark Documentation.
>>> l = [('Alice', 1)]>>> sqlContext.createDataFrame(l).collect()[Row(_1=u'Alice', _2=1)]>>> sqlContext.createDataFrame(l, ['name', 'age']).collect()[Row(name=u'Alice', age=1)]
So concerning your example you can create your desired output like this way:
# Your data at the momentdata = sc.parallelize([ [('Id', 'a0w1a0000003xB1A'), ('PackSize', 1.0), ('Name', 'A')],[('Id', 'a0w1a0000003xAAI'), ('PackSize', 1.0), ('Name', 'B')],[('Id', 'a0w1a00000xB3AAI'), ('PackSize', 30.0), ('Name', 'C')] ])# Convert to tupledata_converted = data.map(lambda x: (x[0][1], x[1][1], x[2][1]))# Define schemaschema = StructType([ StructField("Id", StringType(), True), StructField("Packsize", StringType(), True), StructField("Name", StringType(), True)])# Create dataframeDF = sqlContext.createDataFrame(data_converted, schema)# OutputDF.show()+----------------+--------+----+| Id|Packsize|Name|+----------------+--------+----+|a0w1a0000003xB1A| 1.0| A||a0w1a0000003xAAI| 1.0| B||a0w1a00000xB3AAI| 30.0| C|+----------------+--------+----+
Hope this helps