pyodbc - very slow bulk insert speed

sql-server bulkinsert pyodbc

I was having a similar issue with pyODBC inserting into a SQL Server 2008 DB using executemany(). When I ran a profiler trace on the SQL side, pyODBC was creating a connection, preparing the parametrized insert statement, and executing it for one row. Then it would unprepare the statement, and close the connection. It then repeated this process for each row.

I wasn't able to find any solution in pyODBC that didn't do this. I ended up switching to ceODBC for connecting to SQL Server, and it used the parametrized statements correctly.

sql-server bulkinsert pyodbc

Trying to insert +2M rows into MSSQL using pyodbc was taking an absurdly long amount of time compared to bulk operations in Postgres (psycopg2) and Oracle (cx_Oracle). I did not have the privileges to use the BULK INSERT operation, but was able to solve the problem with the method below.

Many solutions correctly suggested fast_executemany, however, there are some tricks to using it correctly. First, I noticed that pyodbc was committing after each row when autocommit was set to True in the connect method, therefore this must be set to False. I also observed a non-linear slow down when inserting more than ~20k rows at a time, i.e. inserting 10k rows was subsecond, but 50k was upwards of 20s. I assume that the transaction log is getting quite large and slowing the whole thing down. Therefore, you must chunk your insert and commit after each chunk. I found 5k rows per chunk delivered good performance, but this would obviously depend on many factors (the data, the machine, db config etc...).

import pyodbcCHUNK_SIZE = 5000def chunks(l, n):    """Yield successive n-sized chunks from l."""    for i in xrange(0, len(l), n): #use xrange in python2, range in python3        yield l[i:i + n]mssql_conn = pyodbc.connect(driver='{ODBC Driver 17 for SQL Server}',                            server='<SERVER,PORT>',                            timeout=1,                            port=<PORT>,                            uid=<UNAME>,                             pwd=<PWD>,                            TDS_Version=7.2,                            autocommit=False) #IMPORTANTmssql_cur = mssql_conn.cursor()mssql_cur.fast_executemany = True #IMPORTANTparams = [tuple(x) for x in df.values]stmt = "truncate table <THE TABLE>"mssql_cur.execute(stmt)mssql_conn.commit()stmt = """INSERT INTO <THE TABLE> (field1...fieldn) VALUES (?,...,?)"""for chunk in chunks(params, CHUNK_SIZE): #IMPORTANT    mssql_cur.executemany(stmt, chunk)    mssql_conn.commit()

sql-server bulkinsert pyodbc

Tried both ceODBC and mxODBC and both were also painfully slow. Ended up going with an adodb connection with help from http://www.ecp.cc/pyado.html. Total run time improved by a factor of 6!

comConn = win32com.client.Dispatch(r'ADODB.Connection')DSN = 'PROVIDER=Microsoft.Jet.OLEDB.4.0;DATA SOURCE=%s%s' %(dbDIR,dbOut)comConn.Open(DSN)rs = win32com.client.Dispatch(r'ADODB.Recordset')rs.Open('[' + tblName +']', comConn, 1, 3)for f in values:    rs.AddNew(fldLST, f)rs.Update()

CodeHunter

pyodbc - very slow bulk insert speed

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last