CosmosDB - DocumentDB - Bulk insert without saturating collection RU

c# azure azure-cosmosdb nosql

Performing bulk inserts of millions of documents is possible under certain circumstances. We just went through an exercise at my company of moving 100M records from various tables in an Azure SQL DB to CosmosDb.

It's very important to understand CosmosDb partitions. Choosing a good partition key that spreads your data out among partitions is critical to get the kind of throughput you're looking for. Each partition has a maximum RU/s throughput of 10k. If you're trying to shove all of your data into a single partition, it doesn't matter how many RU/s you provision, because anything above 10k is wasted (assuming nothing else is going on for your container).
Also, each logical partition has a max size of 20GB. Once you hit 20GB in size, you'll get errors if you attempt to add more records. Yet another reason to choose your partition key wisely.
Use Bulk Insert. Here's a great video that offers a walkthrough. With the latest NuGet package, it's surprisingly easy to use. I found this video to be a much better explanation than what's on docs.microsoft.com.

EditCosmosDb now has Autoscale. With Autoscale enabled, your Collection will remain at a lower provisioned RU/s, and will automatically scale up to a max threshold when under load. This will save you a ton of money with your specified use case. We've been using this feature since it went GA.

If the majority of your ops are reads, look into Integrated Cache. As of right now, it's in public preview. I haven't played with this, but it can save you money if your traffic is read-heavy.

c# azure azure-cosmosdb nosql

The key to faster insertion is to distribute your load across multiple physical partitions.In your case, based on the total volume of data that is there in the collection, you would have a minimum of totalvolume/10GB partitions.Your total RUs are equally distributed among these partitions.

Based on your data model, if you could partition your data, you could potentially gain speed by writing to different partitions in parallel.

Since you mentioned that you occasionally have to write a batch of a few million rows, I would advice increase the RU's capacity for that period and decrease it back to the levels required for your read load.

Writing using Stored procedures, while saving on the network calls that you make, might not yield much benefit because, the stored procedure could only execute on a single partition. So it could only use the RU's that are allocated to that partition.

https://docs.microsoft.com/en-us/azure/cosmos-db/partition-data#designing-for-partitioning has some good guidance around what kind of partition makes sense.

c# azure azure-cosmosdb nosql

If you can't improve the cost of your inserts, you might go the other way and slow down the process to that your overall performance is not impacted. If you look at the offical performance benchmarking sample (which inserts documents), you could take this as an idea on how to limit the RU/s you require for inserts. It shows a lot of parameters that can be tweaked to improve performance, but those can obviously also be used to tailor your RU/s consumption to a certain level.

The answer by KranthiKiran pretty much sums up all other things I can think of.

CodeHunter

CosmosDB - DocumentDB - Bulk insert without saturating collection RU

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last