Parquet vs ORC vs ORC with Snappy

I would say, that both of these formats have their own advantages.

Parquet might be better if you have highly nested data, because it stores its elements as a tree like Google Dremel does (See here).
Apache ORC might be better if your file-structure is flattened.

And as far as I know parquet does not support Indexes yet. ORC comes with a light weight Index and since Hive 0.14 an additional Bloom Filter which might be helpful the better query response time especially when it comes to sum operations.

The Parquet default compression is SNAPPY. Are Table A - B - C and D holding the same Dataset? If yes it looks like there is something shady about it, when it only compresses to 1.9 GB

hadoop hive parquet snappy orc

You are seeing this because:

Hive has a vectorized ORC reader but no vectorized parquet reader.
Spark has a vectorized parquet reader and no vectorized ORC reader.
Spark performs best with parquet, hive performs best with ORC.

I've seen similar differences when running ORC and Parquet with Spark.

Vectorization means that rows are decoded in batches, dramatically improving memory locality and cache utilization.

(correct as of Hive 2.0 and Spark 2.1)

hadoop hive parquet snappy orc

Both Parquet and ORC have their own advantages and disadvantages. But I simply try to follow a simple rule of thumb - "How nested is your Data and how many columns are there". If you follow the Google Dremel you can find how parquet is designed. They user a hierarchal tree-like structure to store data. More the nesting deeper the tree.

But ORC is designed for a flattened file store. So if your Data is flattened with fewer columns, you can go with ORC, otherwise, parquet would be fine for you. Compression on flattened Data works amazingly in ORC.

We did some benchmarking with a larger flattened file, converted it to spark Dataframe and stored it in both parquet and ORC format in S3 and did querying with **Redshift-Spectrum **.

Size of the file in parquet: ~7.5 GB and took 7 minutes to writeSize of the file in ORC: ~7.1. GB and took 6 minutes to writeQuery seems faster in ORC files.

Soon we will do some benchmarking for nested Data and update the results here.

CodeHunter

Parquet vs ORC vs ORC with Snappy

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last