how to set the primary key when writing a pandas dataframe to a sqlite database table using df.to_sql how to set the primary key when writing a pandas dataframe to a sqlite database table using df.to_sql sqlite sqlite

how to set the primary key when writing a pandas dataframe to a sqlite database table using df.to_sql


Unfortunately there is no way right now to set a primary key in the pandas df.to_sql() method. Additionally, just to make things more of a pain there is no way to set a primary key on a column in sqlite after a table has been created.

However, a work around at the moment is to create the table in sqlite with the pandas df.to_sql() method. Then you could create a duplicate table and set your primary key followed by copying your data over. Then drop your old table to clean up.

It would be something along the lines of this.

import pandas as pdimport sqlite3df = pd.read_csv("/Users/data/" +filename) columns = df.columns columns = [i.replace(' ', '_') for i in columns]#write the pandas dataframe to a sqlite tabledf.columns = columnsdf.to_sql(name,con,flavor='sqlite',schema=None,if_exists='replace',index=True,index_label=None, chunksize=None, dtype=None)#connect to the databaseconn = sqlite3.connect('database')c = conn.curser()c.executescript('''    PRAGMA foreign_keys=off;    BEGIN TRANSACTION;    ALTER TABLE table RENAME TO old_table;    /*create a new table with the same column names and types while    defining a primary key for the desired column*/    CREATE TABLE new_table (col_1 TEXT PRIMARY KEY NOT NULL,                            col_2 TEXT);    INSERT INTO new_table SELECT * FROM old_table;    DROP TABLE old_table;    COMMIT TRANSACTION;    PRAGMA foreign_keys=on;''')#close out the connectionc.close()conn.close()

In the past I have done this as I have faced this issue. Just wrapped the whole thing as a function to make it more convenient...

In my limited experience with sqlite I have found that not being able to add a primary key after a table has been created, not being able to perform Update Inserts or UPSERTS, and UPDATE JOIN has caused a lot of frustration and some unconventional workarounds.

Lastly, in the pandas df.to_sql() method there is a a dtype keyword argument that can take a dictionary of column names:types. IE: dtype = {col_1: TEXT}


Building on Chris Guarino's answer, here's some functions that provide a more general solution. See the example at the bottom for how to use them.

import redef get_create_table_string(tablename, connection):    sql = """    select * from sqlite_master where name = "{}" and type = "table"    """.format(tablename)     result = connection.execute(sql)    create_table_string = result.fetchmany()[0][4]    return create_table_stringdef add_pk_to_create_table_string(create_table_string, colname):    regex = "(\n.+{}[^,]+)(,)".format(colname)    return re.sub(regex, "\\1 PRIMARY KEY,",  create_table_string, count=1)def add_pk_to_sqlite_table(tablename, index_column, connection):    cts = get_create_table_string(tablename, connection)    cts = add_pk_to_create_table_string(cts, index_column)    template = """    BEGIN TRANSACTION;        ALTER TABLE {tablename} RENAME TO {tablename}_old_;        {cts};        INSERT INTO {tablename} SELECT * FROM {tablename}_old_;        DROP TABLE {tablename}_old_;    COMMIT TRANSACTION;    """    create_and_drop_sql = template.format(tablename = tablename, cts = cts)    connection.executescript(create_and_drop_sql)# Example:# import pandas as pd # import sqlite3# df = pd.DataFrame({"a": [1,2,3], "b": [2,3,4]})# con = sqlite3.connect("deleteme.db")# df.to_sql("df", con, if_exists="replace")# add_pk_to_sqlite_table("df", "index", con)# r = con.execute("select sql from sqlite_master where name = 'df' and type = 'table'")# print(r.fetchone()[0])

There is a gist of this code here


Building on Chris Guarino's answer, it is almost impossible to assign a Primary key to an already existing column using df.to_sql() method. Likewise in your 500mb csv file you cannot create an duplicate table with huge number of columns.

However a small Workaround of addding a new column as Primary key while creation of dataframe to SQL. It is possible to iterate over Pandas dataframe.columns function to create a new database and while the creation you can add a Primary key. With this duplicate table is not needed.

i am adding a small Code snippet of it.

import pandas as pdimport sqlite3import sqlalchemy from sqlalchemy import create_enginedf= pd.read_excel(r'C:\XXX\XXX\XXXX\XXX.xlsx',sep=';')X1 = df1.iloc[0:,0:]dataset = X1.astype('float32')dataset['date'] = pd.date_range(start='1/1/2020', periods=len(dataset), freq='D')dataset=dataset.set_index('date')engine = create_engine('sqlite:///measurement.db')sqlite_connection = engine.connect()sqlite_table = "table1"sqlite_connection.execute("CREATE TABLE table1 (id INTEGER PRIMARY KEY AUTOINCREMENT,  date TIMESTAMP, " +         ",".join(["%s REAL" % x for x in dataset.columns]) + ")" )dataset.to_sql(sqlite_table, sqlite_connection, if_exists='append')Output database table:[(0, 'id', 'INTEGER', 0, None, 1),(1, 'date', 'TIMESTAMP', 0, None, 0),(2, 'time_stamp', 'REAL', 0, None, 0),(3, 'column_1', 'REAL', 0, None, 0),(4, 'column_2', 'REAL', 0, None, 0)]

This method works only if the dataframe has an index. Also to have the index as column in our table it should be explicitly defined while writing our query.

Hope this helps for huge database creations.