How can I write a parquet file using Spark (pyspark)?
The error was due to the fact that the textFile
method from SparkContext
returned an RDD
and what I needed was a DataFrame
.
SparkSession has a SQLContext
under the hood. So I needed to use the DataFrameReader
to read the CSV file correctly before converting it to a parquet file.
spark = SparkSession \ .builder \ .appName("Protob Conversion to Parquet") \ .config("spark.some.config.option", "some-value") \ .getOrCreate()# read csvdf = spark.read.csv("/temp/proto_temp.csv")# Displays the content of the DataFrame to stdoutdf.show()df.write.parquet("output/proto.parquet")
You can also write out Parquet files from Spark with koalas. This library is great for folks that prefer Pandas syntax. Koalas is PySpark under the hood.
Here's the Koala code:
import databricks.koalas as ksdf = ks.read_csv('/temp/proto_temp.csv')df.to_parquet('output/proto.parquet')