Sqlite column with repetitive values

python database sqlite duplicates database-performance

The question is very similar to Add data to many-to-many relation with one SQL command, but it also discusses further aspect - automatic clean-up of unused entities.

Is there a way to let SQLite do everything automatically? ... (without having to maintain anything ourselves)

No. You basically want to insert rows in the base table and referenced table if it doesn't exist yet, while specifying the reference by value rather than its surrogate key. That in fact isn't a straightforward task in other RDBMSes as well. Some of them support:

Stored procedures
Multitable inserts
OUTPUT clause
RETURNING clause
Writable (updatable) views
INSTEAD OF triggers on views

From the above list SQLite only supports INSTEAD OF triggers. Here's how it applies to your use case (I have adopted table words from your db<>fiddle mentioned in comments under the question and renamed its column a to value):

PRAGMA foreign_keys = ON;CREATE TABLE words(  id INTEGER PRIMARY KEY,  value TEXT);CREATE UNIQUE INDEX unique_words_value ON words(value);CREATE TABLE data(  id INTEGER PRIMARY KEY,  word_id INTEGER NOT NULL,  FOREIGN KEY (word_id) REFERENCES words(id));CREATE VIEW data_view ASSELECT d.id, w.value FROM data AS d INNER JOIN words AS w on w.id = d.word_id;CREATE TRIGGER data_view_insert INSTEAD OF INSERT ON data_viewBEGIN  INSERT OR IGNORE INTO words(value) VALUES (NEW.value);  INSERT OR IGNORE INTO data(word_id) VALUES(    (SELECT id FROM words WHERE value = NEW.value)  );END;INSERT INTO data_view (value) VALUES  ('random1'),  ('random2'),  ('random3'),  ('random1'),  ('random3'),  ('random4');

The INSERT statement produced this content of table words:

id	value
1	random1
2	random2
3	random3
4	random4

and data_view:

id	value
1	random1
2	random2
3	random3
4	random1
5	random3
6	random4

To support deleting row from data with automatic clean-up of unused values in words you can add INSTEAD OF DELETE trigger on data_view:

CREATE TRIGGER data_view_delete INSTEAD OF DELETE ON data_viewBEGIN  DELETE FROM data    WHERE id = OLD.id;  DELETE FROM words    WHERE value = OLD.value AND NOT EXISTS(SELECT 1 FROM data_view WHERE value = OLD.value);END;

Its first statement deletes a row in table data and the second deletes the referenced value from words, if it isn't referenced by other rows in table data. Similarly to INSERT, you delete rows from table data indirectly via data_view:

-- delete one 'random2' and 'random4' value dataDELETE FROM data_view WHERE id IN (2, 4);

That results in words:

id	value
1	random1
3	random3
4	random4

As you can see the 'random2' value was deleted, because it was referenced only once in data, and 'random4' value was kept, because there was another reference in table data (data_view):

id	value
1	random1
3	random3
5	random3
6	random4

Here's db<>fiddle to play with.

python database sqlite duplicates database-performance

Full credit to @PeterWolf for his excellent answer, here is a slightly modified version and ready-to-run code:

import sqlite3, randomVALUES = ["hello world", "it's a shame to store this str many times", "bye bye", "abc"]db = sqlite3.connect('repetitive3.db')db.executescript("""CREATE TABLE words(id INTEGER PRIMARY KEY, value TEXT UNIQUE);CREATE TABLE data(id INTEGER PRIMARY KEY, word_id INTEGER NOT NULL);CREATE VIEW data_view AS SELECT d.id, w.value FROM data AS d INNER JOIN words AS w on w.id = d.word_id;CREATE TRIGGER data_view_insert INSTEAD OF INSERT ON data_viewBEGIN  INSERT OR IGNORE INTO words(value) VALUES(NEW.value);  INSERT OR IGNORE INTO data(word_id) VALUES((SELECT id FROM words WHERE value = NEW.value));END;""")for i in range(1000*1000):    db.execute("INSERT INTO data_view (value) VALUES (?)", (random.choice(VALUES),))print(list(db.execute("SELECT * FROM words")))print(list(db.execute("SELECT * FROM data WHERE id BETWEEN 100 AND 105")))print(list(db.execute("SELECT * FROM data_view WHERE id BETWEEN 100 AND 105")))

Minor modifications (the result is the same: 9 MB only database size)

there is here no index on table words, but only the UNIQUE keyword for value, that ensures that INSERT OR IGNORE INTO falls into IGNORE if the value already exists
no FOREIGN KEY used

Same idea, but without a view:

db.execute('CREATE TABLE words(value TEXT UNIQUE);')db.execute('CREATE TABLE data(id INTEGER PRIMARY KEY, word_id INTEGER NOT NULL);')for i in range(1000*1000):    v = random.choice(VALUES)    db.execute("INSERT OR IGNORE INTO words(value) VALUES(?);", (v,))    db.execute("INSERT INTO data(word_id) VALUES ((SELECT rowid FROM words WHERE value = ?));", (v,))searched_word = 'hello world'print(list(db.execute("SELECT id, word_id FROM data WHERE id BETWEEN 100 AND 120 AND word_id = (SELECT rowid FROM words WHERE value = ?)", (searched_word,))))

CodeHunter

Sqlite column with repetitive values

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last