Getting large rows out of SQL Azure - but where to go? Tables, Blob or something like MongoDB?

mongodb azure azure-sql-database azure-blob-storage azure-table-storage

I signed up just solely to help with this question. In the past, I have found useful answers to my problems from Stackoverflow - thank you community - so I thought it would just be fair (perhaps fair is an understatement) to attempt to give something back with this question, as it falls on my alley.

In short, while considering all factors stated in the question, table storage may be the best option - iif you can properly estimate transactions per month: a nice article on this. You can solve the two limitations that you mentioned, row and column limit, by splitting (plain text method or serializing it) the document/html/data. Speaking from experience with 40 GB+ data stored in Table Storage, where frequently our app retrieves more than 10 rows per each page visit in milliseconds - no argument here! If you need 50+ rows at times, you are looking at low single digits second(s), or you can do them in parallel (and further by splitting the data in different partitions), or in some async fashion. Or, read suggested multi level caching below.

A bit more detail. I tried with SQL Azure, Blob (both page and block), and Table Storage. I can not speak for Mongo DB since, partially for the reasons already mentioned here, I did not want to go that route.

Table Storage is fast; in the range of 20-50 milliseconds, or even faster sometimes (depends, for instance in the same data center i have seen it gone as low as 10 milliseconds), when querying with partition and row key. You may also further have several partitions, in some fashion based on your data and your knowledge about it.
It scales better, in terms of GB's but not transactions
Row and column limitations that you mentioned are a burden, agreed, but not a show stopper. I have written my own solution to split entities, you can too easily, or you can see this already-written-solution (does not solve the whole problem but it is a good start): https://code.google.com/p/lokad-cloud/wiki/FatEntities
Also need to keep in mind that uploading data to table storage is time consuming, even when batching entities due to other limitations (i.e., request size less than 4 MB, upload bandwidth, etc).

But using solely just TableStorage may not be the best solution (thinking about growth and economics). The best solution that we ended up implementing used multi-level caching/storage, starting from static classes, Azure Role Based Cache, Table Storage, and Block Blobs. Lets call this, for readability purposes, level 1A, 1B, 2 and 3 respectively. Using this approach, we are using a medium single instance (2 CPU Cores and 3.5 GB Ram - my laptop has better performance), and are able to process/query/rank 100GB+ of data in seconds (95% of cases in under 1 second). I believe this is fairly impressive given that we check all "articles" before displaying them (4+ million "articles"). First, this is tricky and may or may not be possible in your case. I do not have sufficient knowledge about the data and its query/processing usage, but if you can find a way to organize the data well this may be ideal. I will make an assumption: it sounds like you are trying to search through and find relevant articles given some information about a user and some tags (a variant of a news aggregator perhaps, just got a hunch for that). This assumption is made for the sake of illustrating the suggestion, so even if not correct, I hope it will help you or trigger new ideas on how this could be adopted.

Level 1A data.Identify and add key entities or its properties in a static class (periodically, depending on how you foresee updates). Say we identify user preferences (e.g., demographics and interest, etc) and tags (tech, politics, sports, etc). This will be used to retrieve quickly who the user is, his/her preferences, and any tags. Think of these as key/value pair; for instance key being a tag, and its value being a list of article IDs, or a range of it. This solves a small piece of a problem, and that is: given a set of keys (user pref, tags, etc) what articles are we interested in! This data should be small in size, if organized properly (e.g., instead of storing article path, you can only store a number). *Note: the problem with data persistence in a static class is that application pool in Azure, by default, resets every 20 minutes or so of inactivity, thus your data in the static class is not persistent any longer - also sharing them across instances (if you have more than 1) can become a burden. Welcome level 1B to the rescue.

Leval 1B dataA solution we used, is to keep layer 1A data in a Azure Cache, for its sole purpose to re-populate the static entity when and if needed. Level 1B data solves this problem. Also, if you face issues with application pool reset timing, you can change that programmatically. So level 1A and 1B have the same data, but one is faster than the other (close enough analogy: CPU Cache and RAM).

Discussing level 1A and 1B a bitOne may point out that it is an overkill to use a static class and cache, since it uses more memory. But, the problem we found in practice, is that, first it is faster with static. Second, in cache there are some limitations (ie., 8 MB per object). With big data, that is a small limit. By keeping data in a static class one can have larger than 8 MB objects, and store them in cache by splitting them (i.e., currently we have over 40 splits). BTW please vote to increase this limit in the next release of azure, thank you! Here is the link: www.mygreatwindowsazureidea.com/forums/34192-windows-azure-feature-voting/suggestions/3223557-azure-preview-cache-increase-max-item-size

Level 2 dataOnce we get the values from the key/value entity (level 1A), we use the value to retrieve the data in Table Storage. The value should tell you what partition and Row Key you need. Problem being solved here: you only query those rows relevant to the user/search context. As you can see now, having level 1A data is to minimize row querying from table storage.

Level 3 dataTable storage data can hold a summary of your articles, or the first paragraph, or something of that nature. When it is needed to show the whole article, you will get it from Blob. Table storage, should also have a column that uniquely identifies the full article in blob. In blob you may organize the data in the following manner:

Split each article in separate files.
Group n articles in one file.
Group all articles in one file (not recommended although not as bad as the first impression one may get).

For the 1st option you would store, in table storage, the path of the article, then just grab it directly from Blob. Because of the above levels, you should need to read only a few full articles here.

For the 2nd and 3rd option you would store, in table storage, the path of the file and the start and end position from where to read and where to stop reading, using seek.

Here is a sample code in C#:

YourBlobClientWithReferenceToTheFile.Seek(TableStorageData.start, SeekOrigin.Begin);        int numBytesToRead = (int)TableStorageData.end - (int)TableStorageData.start;        int numBytesRead = 0;        while (numBytesToRead > 0)        {          int n = YourBlobClientWithReferenceToTheFile.Read(bytes,numBytesRead,numBytesToRead);            if (n == 0)                break;            numBytesRead += n;            numBytesToRead -= n;        }

I hope this didn't turn into a book, and hope it was helpful. Feel free to contact me if you have follow up questions or comments. Thanks!

mongodb azure azure-sql-database azure-blob-storage azure-table-storage

The proper storage for a file is a blob. But if your query needs to return dozens of blobs at the same time, it will be too slow as you are pointing out. So you could use a hybrid approach: use Azure Tables for 98% of your data, and if it's too large, use a Blob instead and store the Blob URI in your table.

Also, are you compressing your content at all? I sure would.

mongodb azure azure-sql-database azure-blob-storage azure-table-storage

You could use MongoDB's GridFS feature: http://docs.mongodb.org/manual/core/gridfs/

It splits the data into 256k chunks by default (configurable up to 16mb) and lets you use the sharded database as a filesystem which you can use to store and retrieve files. If the file is larger than the chunk size, the mongo db drivers handle splitting up / re-assembling the data when the file needs to be retrieved. To add additional disk space, simply add additional shards.

You should be aware, however that only some mongodb drivers support this and it is a driver convention and not a server feature that allows for this behavior.

CodeHunter

Getting large rows out of SQL Azure - but where to go? Tables, Blob or something like MongoDB?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last