Best way to Insert Python NumPy array into PostgreSQL database
Not sure if this is what you are after, but assuming you have read/write access to an existing postgres DB:
import numpy as npimport psycopg2 as psyimport pickledb_connect_kwargs = { 'dbname': '<YOUR_DBNAME>', 'user': '<YOUR_USRNAME>', 'password': '<YOUR_PWD>', 'host': '<HOST>', 'port': '<PORT>'}connection = psy.connect(**db_connect_kwargs)connection.set_session(autocommit=True)cursor = connection.cursor()cursor.execute( """ DROP TABLE IF EXISTS numpy_arrays; CREATE TABLE numpy_arrays ( uuid VARCHAR PRIMARY KEY, np_array_bytes BYTEA ) """)
The gist of this approach is to store any numpy array (of arbitrary shape and data type) as a row in the numpy_arrays
table, where uuid
is a unique identifier to be able to later retrieve the array. The actual array would be saved in the np_array_bytes
column as bytes.
Inserting into the database:
some_array = np.random.rand(1500,550)some_array_uuid = 'some_array'cursor.execute( """ INSERT INTO numpy_arrays(uuid, np_array_bytes) VALUES (%s, %s) """, (some_array_uuid, pickle.dumps(some_array)))
Querying from the database:
uuid = 'some_array'cursor.execute( """ SELECT np_array_bytes FROM numpy_arrays WHERE uuid=%s """, (uuid,))some_array = pickle.loads(cursor.fetchone()[0])
Performance?
If we could store our NumPy arrays directly in PostgreSQL we would get a major performance boost.
I haven't benchmarked this approach in any way, so I can't confirm nor refute this...
Disk Space?
My guess is that this approach takes as much disk space as dumping the arrays to a file using np.save('some_array.npy', some_array)
. If this is an issue consider compressing the bytes before insertion.
You can use subprocess.run()
to execute shell commands for bulk copying from the csv files to the server using Postgressql COPY
in pipes. I'm more familiar with mssql
which has the bcp
method, unable to test fully on my solution, though I imagine it's a similar method of calling through terminal. Terminal command is based off 3rd link utilizing the method, though that solution uses subprocess.call()
which has since been updated with subprocess.run()
.
https://docs.python.org/3/library/subprocess.html#subprocess.runhttps://ieftimov.com/post/postgresql-copy/
Python psql \copy CSV to remote server
import subprocesspsql_command = "\"\copy table (col1, col2) FROM file_location CSV HEADER QUOTE '\\\"' NULL ''\""# user, hostname, password, dbname all defined elsewhere above.command = ["psql", "-U", user, "-h", hostname, "-d", dbname, "-w", password, "-c", psql_command,]subprocess.run(command)