What is the advantage of storing schema in avro? What is the advantage of storing schema in avro? hadoop hadoop

What is the advantage of storing schema in avro?


  1. Evolving schemas

Suppose intially you designed an schema like this for your Employee class

{{"name": "emp_name", "type":"string"},{"name":"dob", "type":"string"},{"name":"age", "type":"int"}}

Later you realized that age is redundant and removed it from the schema.

{{"name": "emp_name", "type":"string"},{"name":"dob", "type":"string"}}

What about the records that were serialized and stored before this schema change. How will you read back those records?

That's why the avro reader/deserializer asks for the reader and writer schema. Internally it does schema resolution ie. it tries to adapt the old schema to new schema.

Go to this link - http://avro.apache.org/docs/1.7.2/api/java/org/apache/avro/io/parsing/doc-files/parsing.html - section "Resolution using action symbols"

In this case it does skip action, ie it leaves out reading "age". It can also handle cases like a field changes from int to long etc..

This is a very nice article explaining schema evolution - http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

  1. Schema is stored only once for multiple records in a single file.

  2. Size, encoded in very few bytes.


I think one of the key problems solved by schema evolution is not mentioned anywhere explicitly and that is why it causes so much confusion for the new-comers.

An example will clarify this:

Let us say a bank stores an audit log of all its transactions. The logs have a particular format and need to be stored for at least 10 years. It is also very much desirable that the system holding these logs should adapt to the format evolving in all of these 10 years.

Schema for such entries would not change too often, let us say twice an year on an average but each schema would have a large number of entries. If we do not keep track of the schemas, then after a while, we will need to consult very old code to figure out the fields present at that time and keep on adding if-else statements for processing different formats. With a schema store of all these formats, we can use the schema-evolution feature to automatically convert one kind of format into the other (Avro does this automatically if you provide it with older and newer schemas). This saves the applications from adding lot of if-else statements in their code and also makes it more manageable as we readily know what are all the formats we have by looking at the set of schemas stored (Schemas are generally stored in a separate storage and the data only has an ID pointing to its schema).

Another advantage of schema evolution is that producers of new format can safely produce objects with new schema without waiting for the downstream consumers to change first. The downstream consumers can have the logic built in to simply suspend processing unless they have visibility of the new schema associated with a new format. This automatic suspension is great to keep the system online and adapt the processing logic for the new schema.

So in summary, schema evolution helps the newer clients read older formats by making use of automatic format conversion and also helps the older clients suspend processing in a graceful manner till they have been enabled to understand newer formats.