Can I append Avro serialized data to an existing Azure blob? Can I append Avro serialized data to an existing Azure blob? hadoop hadoop

Can I append Avro serialized data to an existing Azure blob?


The short answer here is that I was trying to do the wrong thing.

First, we decided that Avro is not the appropriate format for the on-the-wire serialization. Primarily, because Avro expects the schema definition to be present in every Avro file. This adds a lot of weight to what is trasmitted. You could still use Avro, but that's not what it's designed for. (It is designed for big files on HDFS.)

Secondly, the existing libraries (for .NET) only support appending to Avro files via a stream. This does not map well to Azure block blobs (you don't want to open a block blob as a stream).

Thirdly, even if these first two could be bypassed, all of the items in a single Avro file are expected to share the same schema. We had a set of heterogenous items flowing in that we wanted to buffer, batch, and write to blob. Trying to segregate the items by type/schema as we were writing them to blob added lots of complication. In the end, we opted to use JSON.


It is possible to do.

First of all, you have to use CloudAppendBlob:

CloudAppendBlob appBlob = container.GetAppendBlobReference(            string.Format("{0}{1}", date.ToString("yyyyMMdd"), ".log"));appBlob.AppendText(                string.Format(                "{0} | Error: Something went wrong and we had to write to the log!!!\r\n",                dateLogEntry.ToString("o")));

Second step is to tell to avro lib not to write header on append and share the same sync marker between appends:

var avroSerializer = AvroSerializer.Create<Object>();        using (var buffer = new MemoryStream())        {            using (var w = AvroContainer.CreateWriter<Object>(buffer, Codec.Deflate))            {                Console.WriteLine("Init Sample Data Set...");                var headerField = w.GetType().GetField("header", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);                var header = headerField.GetValue(w);                var marker = header.GetType().GetField("syncMarker", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);                marker.SetValue(header, new byte[16]);                using (var writer = new SequentialWriter<Object>(w, 24))                {                    // Serialize the data to stream by using the sequential writer                    for (int i = 0; i < 10; i++)                    {                        writer.Write(new Object());                    }                }            }            Console.WriteLine("Append Sample Data Set...");            //Prepare the stream for deserializing the data            using (var w = AvroContainer.CreateWriter<Object>(buffer, Codec.Deflate))            {                var isHeaderWritten = w.GetType().GetField("isHeaderWritten", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);                isHeaderWritten.SetValue(w, true);                var headerField = w.GetType().GetField("header", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);                var header = headerField.GetValue(w);                var marker = header.GetType().GetField("syncMarker", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);                marker.SetValue(header, new byte[16]);                using (var writer = new SequentialWriter<Object>(w, 24))                {                    // Serialize the data to stream by using the sequential writer                    for (int i = 10; i < 20; i++)                    {                        writer.Write(new Object());                    }                }            }            Console.WriteLine("Deserializing Sample Data Set...");        }