How to write pyspark dataframe to HDFS and then how to read it back into dataframe? How to write pyspark dataframe to HDFS and then how to read it back into dataframe? hadoop hadoop

How to write pyspark dataframe to HDFS and then how to read it back into dataframe?


  • writing DataFrame to HDFS (Spark 1.6).

    df.write.save('/target/path/', format='parquet', mode='append') ## df is an existing DataFrame object.

some of the format options are csv, parquet, json etc.

  • reading DataFrame from HDFS (Spark 1.6).

    from pyspark.sql import SQLContextsqlContext = SQLContext(sc)sqlContext.read.format('parquet').load('/path/to/file') 

the format method takes argument such as parquet, csv, json etc.