Single or Multiple Entities Per Collection in DocumentDB Single or Multiple Entities Per Collection in DocumentDB azure azure

Single or Multiple Entities Per Collection in DocumentDB


It is generally good to store multiple entity types per collection. Whether to store entity types within to a single document or not takes a bit more thought.

As David mentioned - how to model data is a bit subjective.

Storing Multiples Entity Types within a Collection

First... let's talk about storing multiple entities in a collection. DocumentDB collections are not tables. Collections do not enforce schema; in other words, you can store different types of documents with different schemas in the same collection. You can track different types of entities simply by adding a type attribute to your document.

You should think of Collections as a unit of partition and boundary for the execution of queries and transactions. Thus a huge perk for storing different entity types within the same collection is you get transaction support right out of the box via sprocs.

Storing Multiple Entity Types within a Document

Whether you store multiple entity types within a single document takes a bit more thought. This is commonly referred to de-normalizing (capturing relationships between data by embedding data in a single document) and normalizing (capturing relationships between data by creating weak links to o other documents) your data.

Typically de-normalizing provides better read performance.

The application may need to issue fewer queries and updates to complete common operations.

In general, use de-normalized data models when:

  • have “contains” relationships between entities
  • have one-to-few relationships between entities
  • de-normalized data changes infrequently
  • de-normalized data won’t grow without bound
  • de-normalized data is integral to data in document

Example of a de-normalized data model:

{  "Id": 1001,  "Type": "Company",  "Industry": "Software",  "Employees": [    {      "Id": 10011,      "Type": "Employee",      "Name": "John Doe"    },    {      "Id": 10012,      "Type": "Employee",      "Name": "Jane Doe"    }  ]}

Typically normalizing provides better write performance.

Provides more flexibility than de-normalizing

Client-side applications must issue follow-up queries to resolve the references. In other words, normalized data models can require more round trips to the server.

In general, use normalized data models:

  • when de-normalizing would result in duplication of data but would not provide sufficient read performance advantages to outweigh the implications of the duplication.
  • representing one-to-many relationships
  • represent many-to-many relationships.
  • related data changes frequently

Example of a normalized data model:

{  "Id": 1001,  "Type": "Company",  "Industry": "Software"}{  "Id": 10011,  "Type": "Employee",  "Name": "John Doe",  "CompanyId": 1001}{  "Id": 10012,  "Type": "Employee",  "Name": "Jane Doe",  "CompanyId": 1001}

Hybrid Approaches

Choosing between normalizing and de-normalizing doesn't have to be a black and white choice. I've often found that a winning design pattern is a hybrid approach, in which you may choose to normalize a partial set of an object's fields and de-normalize the others.

In other words, you could choose to de-normalize frequently read stable (or immutable) properties to reduce the need for follow up queries, while normalize frequently written / mutating fields to reduce the need for fanning out writes.

Example of a hybrid approach:

// Author documents:[{  "id": 1,  "firstName": "Thomas",  "lastName": "Andersen",  "countOfBooks": 3,  "books": [1, 2, 3],  "images": [{    "thumbnail": "http://....png"  }, {    "profile": "http://....png"  }, {    "large": "http://....png"  }]}, {  "id": 2,  "firstName": "William",  "lastName": "Wakefield",  "countOfBooks": 1,  "books": [1, 4, 5],  "images": [{    "thumbnail": "http://....png"  }]}]// Book documents:[{  "id": 1,  "name": "DocumentDB 101",  "authors": [{    "id": 1,    "name": "Thomas Andersen",    "thumbnailUrl": "http://....png"  }, {    "id": 2,    "name": "William Wakefield",    "thumbnailUrl": "http://....png"  }]}, {  "id": 2,  "name": "DocumentDB for RDBMS Users",  "authors": [{    "id": 1,    "name": "Thomas Andersen",    "thumbnailUrl": "http://....png"  }, ]}]


Your question is a bit subjective, as you're asking for entity design, and for that, there's no single right answer.

However: From a more objective standpoint: There's nothing stopping you from having multiple entity types within a collection (e.g. Company document type and Employee document type, in your case).

You'd need to include some type of hint for yourself (maybe a type property) to help differentiate between the two when running your queries. But, by having both types within the same collection, you now have a collection-scope to work within. Regarding a type property: Since DocumentDB indexes all properties by default, a type property would be easy to integrate into your queries.

EDIT Removed part about 3-collections-per-capacity-unit, as that arrangement was removed when DocumentDB shifted from Preview to Production.


Many things have changed in Cosmos DB in the last 5 years and one of the most significant changes affecting data structure design is the possibility to create many containers and share the RU among all of them.

It is still fine to combine multiple entity types in the same container (new name for collection). However, in the year 2020, it is also fine to put each entity type in a separate container.

Of course it depends on the needs of your application and a very significant consideration is how you intend to read this information. However, here is a general data structure and approach that you can consider:

  1. Keep each entity in their own container
  2. Include a property with the name of the entity
  3. Choose partition keys in a way so you will never have more than 10 GB data in one partition
  4. Identify the reads for which you will require optimal performance because of high volume of calls
  5. Here's the kicker: Replicate data to a container optimized for expected reads

For those reads you want to optimize, replicate the data to a new container dedicated to that purpose and make sure the partition key matches the main parameter that you query by. You can put many different entities in the same container.

You will find that the read operation is many times more efficient this way.

Cosmos DB performance is largely determined by the volume of the data and if you make sure your documents are easy to reach via partition key, you do not get any significant performance gains by putting a lot of data inside a single document as opposed to keeping them in separate documents.

Example

You have two containers:

  • orders
  • order-details

The orders are partitioned by productId and order details are partitioned by orderId. But for a new feature that shows the order history of a single user, it costs far too many RU, not only to get the orders by the userId property, but especially to make subsequent calls for every order to get the order details that are placed in separate partitions.

Instead, luckily, both documents contain a userId property. What you do is create a new container maybe called orders-by-user and configure the userId property to be partition key. Then replicate all documents from orders and order-details to this container.

You can now make incredibly efficient reads on this container by userId.

You can replicate using Data Factory, Azure Functions based on change feed and soon there will be a built-in feature for this purpose (look in the comments):https://stackoverflow.com/a/64355508/392362