How to append data to an existing parquet file How to append data to an existing parquet file hadoop hadoop

How to append data to an existing parquet file


There is a Spark API SaveMode called append: https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/SaveMode.html which I believe solves your problem.

Example of use:

df.write.mode('append').parquet('parquet_data_file')


Parquet is a columnar file, It optimizes writing all columns together. If any edit it requires to rewrite the file.

From Wiki

A column-oriented database serializes all of the values of a column together, then the values of the next column, and so on. For our example table, the data would be stored in this fashion:

10:001,12:002,11:003,22:004;Smith:001,Jones:002,Johnson:003,Jones:004;Joe:001,Mary:002,Cathy:003,Bob:004;40000:001,50000:002,44000:003,55000:004;

Some links

https://en.wikipedia.org/wiki/Column-oriented_DBMS

https://parquet.apache.org/