Dynamic columns in database tables vs EAV Dynamic columns in database tables vs EAV sqlite sqlite

Dynamic columns in database tables vs EAV


I will assume that SQLite (or another relational DBMS) is a requirement.

EAVs

I have worked with EAVs and generic data models, and I can say that the data model is very messy and hard to work with in the long run.

Lets say that you design a datamodel with three tables: entities, attributes, and _entities_attributes_:

CREATE TABLE entities(entity_id INTEGER PRIMARY KEY, name TEXT);CREATE TABLE attributes (attribute_id INTEGER PRIMARY KEY, name TEXT, type TEXT);CREATE TABLE entity_attributes (entity_id INTEGER, attribute_id INTEGER, value TEXT, PRIMARY KEY(entity_id, attribute_id));

In this model, the entities table will hold your cars, the attributes table will hold the attributes that you can associate to your cars (brand, model, color, ...) and its type (text, number, date, ...), and the _entity_attributes_ will hold the values of the attributes for a given entity (for example "red").

Take into account that with this model you can store as many entities as you want and they can be cars, houses, computers, dogs or whatever (ok, maybe you need a new field on entities, but it's enough for the example).

INSERTs are pretty straightforward. You only need to insert a new object, a bunch of attributes and its relations. For example, to insert a new entity with 3 attributes you will need to execute 7 inserts (one for the entity, three more for the attributes, and three more for the relations.

When you want to perform an UPDATE, you will need to know what is the entity that you want to update, and update the desired attribute joining with the relation between the entity and its attributes.

When you want to perform a DELETE, you will also need to need to know what is the entity you want to delete, delete its attributes, delete the relation between your entity and its attributes and then delete the entity.

But when you want to perform a SELECT the thing becomes nasty (you need to write really difficult queries) and the performance drops horribly.

Imagine a data model to store car entities and its properties as in your example (say that we want to store brand and model). A SELECT to query all your records will be

SELECT brand, model FROM cars;

If you design a generic data model as in the example, the SELECT to query all your stored cars will be really difficult to write and will involve a 3 table join. The query will perform really bad.

Also, think about the definition of your attributes. All your attributes are stored as TEXT, and this can be a problem. What if somebody makes a mistake and stores "red" as a price?

Indexes are another thing that you could not benefit of (or at least not as much as it would be desirable), and they are very neccesary as the data stored grows.

As you say, the main concern as a developer is that the queries are really hard to write, hard to test and hard to maintain (how much would a client have to pay to buy all red, 1980, Pontiac Firebirds that you have?), and will perform very poorly when the data volume increases.

The only advantage of using EAVs is that you can store virtually everything with the same model, but is like having a box full of stuff where you want to find one concrete, small item.

Also, to use an argument from authority, I will say that Tom Kyte argues strongly against generic data models:http://tkyte.blogspot.com.es/2009/01/this-should-be-fun-to-watch.htmlhttps://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:10678084117056

Dynamic columns in database tables

On the other hand, you can, as you say, generate the tables dynamically, adding (and removing) columns when needed. In this case, you can, for example create a car table with the basic attributes that you know that you will use and then add columns dynamically when you need them (for example the number of exhausts).

The disadvantage is that you will need to add columns to an existing table and (maybe) build new indexes.

This model, as you say, also has another problem when working with SQLite as there's no direct way to delete columns and you will need to do this as stated on http://www.sqlite.org/faq.html#q11

BEGIN TRANSACTION;CREATE TEMPORARY TABLE t1_backup(a,b);INSERT INTO t1_backup SELECT a,b FROM t1;DROP TABLE t1;CREATE TABLE t1(a,b);INSERT INTO t1 SELECT a,b FROM t1_backup;DROP TABLE t1_backup;COMMIT;

Anyway, I don't really think that you will need to delete columns (or at least it will be a very rare scenario). Maybe someone adds the number of doors as a column, and stores a car with this property. You will need to ensure that any of your cars have this property to prevent from losing data before deleting the column. But this, of course depends on your concrete scenario.

Another drawback of this solution is that you will need a table for each entity you want to store (one table to store cars, another to store houses, and so on...).

Another option (pseudo-generic model)

A third option could be to have a pseudo-generic model, with a table having columns to store id, name, and type of the entity, and a given (enough) number of generic columns to store the attributes of your entities.

Lets say that you create a table like this:

CREATE TABLE entities(entity_id INTEGER PRIMARY KEY, name TEXT, type TEXT, attribute1 TEXT, attribute1 TEXT, ... attributeN TEXT );

In this table you can store any entity (cars, houses, dogs) because you have a type field and you can store as many attributes for each entity as you want (N in this case).

If you need to know what the attribute37 stands for when type is "red", you would need to add another table that relates the types and attributes with the description of the attributes.

And what if you find that one of your entities needs more attributes? Then simply add new columns to the entities table (attributeN+1, ...).

In this case, the attributes are always stored as TEXT (as in EAVs) with it's disadvantages.

But you can use indexes, the queries are really simple, the model is generic enough for your case, and in general, I think that the benefits of this model are greater than the drawbacks.

Hope it helps.


Follow up from the comments:

With the pseudo-generic model your entities table will have a lot of columns. From the documentation (https://www.sqlite.org/limits.html), the default setting for SQLITE_MAX_COLUMN is 2000. I have worked with SQLite tables with over 100 columns with great performance, so 40 columns shouldn't be a big deal for SQLite.

As you say, most of your columns will be empty for most of your records, and you will need to index all of your colums for performance, but you can use partial indexes (https://www.sqlite.org/partialindex.html). This way, your indexes will be small, even with a high number of rows, and the selectivity of each index will be great.

If you implement a EAV with only two tables, the number of joins between tables will be less than in my example, but the queries will still be hard to write and maintain, and you will need to do several (outer) joins to extract data, which will reduce performance, even with a great index, when you store a lot of data. For example, imagine that you want to get the brand, model and color of your cars. Your SELECT would look like this:

SELECT e.name, a1.value brand, a2.value model, a3.value colorFROM entities eLEFT JOIN entity_attributes a1 ON (e.entity_id = a1.entity_id and a1.attribute_id = 'brand')LEFT JOIN entity_attributes a2 ON (e.entity_id = a2.entity_id and a2.attribute_id = 'model')LEFT JOIN entity_attributes a3 ON (e.entity_id = a3.entity_id and a3.attribute_id = 'color');

As you see, you would need one (left) outer join for each attribute you want to query (or filter). With the pseudo-generic model the query will be like this:

SELECT name, attribute1 brand, attribute7 model, attribute35 colorFROM entities;

Also, take into account the potential size of your _entity_attributes_ table. If you can potentially have 40 attributes for each entity, lets say that you have 20 not null for each of them. If you have 10,000 entities, your _entity_attributes_ table will have 200,000 rows, and you will be querying it using one huge index. With the pseudo-generic model you will have 10,000 rows and one small index for each column.


It all depends on the way in which your application needs to reason about the data.

If you need to run queries which need to do complicated comparisons or joins on data whose schema you don't know in advance, SQL and the relational model are rarely a good fit.

For instance, if your users can set up arbitrary data entities (like "car" in your example), and then want to find cars whose engine capacity is greater than 2000cc, with at least 3 doors, made after 2010, whose current owner is part of the "little old ladies" table, I'm not aware of an elegant way of doing this in SQL.

However, you could achieve something like this using XML, XPath etc.

If your application has a set on data entities with known attributes, but users can extend those attributes (a common requirement for products like bug trackers), "add column" is a good solution. However, you may need to invent a custom query language to allow users to query those columns. For instance, Atlassian Jira's bug tracking solution has JQL, a SQL-like language for querying bugs.

EAV is great if your task is to store and then show data. However, even moderately complex queries become very hard in an EAV schema - imagine how you'd execute my made up example above.


For your use case, a document oriented database like MongoDB would do great.