Avro vs. Parquet

hadoop avro parquet

If you haven't already decided, I'd go ahead and write Avro schemas for your data. Once that's done, choosing between Avro container files and Parquet files is about as simple as swapping out e.g.,

job.setOutputFormatClass(AvroKeyOutputFormat.class);AvroJob.setOutputKeySchema(MyAvroType.getClassSchema());

for

job.setOutputFormatClass(AvroParquetOutputFormat.class);AvroParquetOutputFormat.setSchema(job, MyAvroType.getClassSchema());

The Parquet format does seem to be a bit more computationally intensive on the write side--e.g., requiring RAM for buffering and CPU for ordering the data etc. but it should reduce I/O, storage and transfer costs as well as make for efficient reads especially with SQL-like (e.g., Hive or SparkSQL) queries that only address a portion of the columns.

In one project, I ended up reverting from Parquet to Avro containers because the schema was too extensive and nested (being derived from some fairly hierarchical object-oriented classes) and resulted in 1000s of Parquet columns. In turn, our row groups were really wide and shallow which meant that it took forever before we could process a small number of rows in the last column of each group.

I haven't had much chance to use Parquet for more normalized/sane data yet but I understand that if used well, it allows for significant performance improvements.

hadoop avro parquet

Avro is a Row based format. If you want to retrieve the data as a whole you can use Avro

Parquet is a Column based format. If your data consists of a lot of columns but you are interested in a subset of columns then you can use Parquet

HBase is useful when frequent updating of data is involved. Avro is fast in retrieval, Parquet is much faster.

hadoop avro parquet

Avro

Widely used as a serialization platform
Row-based, offers a compact and fast binary format
Schema is encoded on the file so the data can be untagged
Files support block compression and are splittable
Supports schema evolution

Parquet

Column-oriented binary file format
Uses the record shredding and assembly algorithm described in the Dremel paper
Each data file contains the values for a set of rows
Efficient in terms of disk I/O when specific columns need to be queried

From Choosing an HDFS data storage format- Avro vs. Parquet and more

CodeHunter

Avro vs. Parquet

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last