Best way to Insert Python NumPy array into PostgreSQL database Best way to Insert Python NumPy array into PostgreSQL database sql sql

Best way to Insert Python NumPy array into PostgreSQL database


Not sure if this is what you are after, but assuming you have read/write access to an existing postgres DB:

import numpy as npimport psycopg2 as psyimport pickledb_connect_kwargs = {    'dbname': '<YOUR_DBNAME>',    'user': '<YOUR_USRNAME>',    'password': '<YOUR_PWD>',    'host': '<HOST>',    'port': '<PORT>'}connection = psy.connect(**db_connect_kwargs)connection.set_session(autocommit=True)cursor = connection.cursor()cursor.execute(    """    DROP TABLE IF EXISTS numpy_arrays;    CREATE TABLE numpy_arrays (        uuid VARCHAR PRIMARY KEY,        np_array_bytes BYTEA    )    """)

The gist of this approach is to store any numpy array (of arbitrary shape and data type) as a row in the numpy_arrays table, where uuid is a unique identifier to be able to later retrieve the array. The actual array would be saved in the np_array_bytes column as bytes.

Inserting into the database:

some_array = np.random.rand(1500,550)some_array_uuid = 'some_array'cursor.execute(    """    INSERT INTO numpy_arrays(uuid, np_array_bytes)    VALUES (%s, %s)    """,    (some_array_uuid, pickle.dumps(some_array)))

Querying from the database:

uuid = 'some_array'cursor.execute(    """    SELECT np_array_bytes    FROM numpy_arrays    WHERE uuid=%s    """,    (uuid,))some_array = pickle.loads(cursor.fetchone()[0])

Performance?

If we could store our NumPy arrays directly in PostgreSQL we would get a major performance boost.

I haven't benchmarked this approach in any way, so I can't confirm nor refute this...

Disk Space?

My guess is that this approach takes as much disk space as dumping the arrays to a file using np.save('some_array.npy', some_array). If this is an issue consider compressing the bytes before insertion.


You can use subprocess.run() to execute shell commands for bulk copying from the csv files to the server using Postgressql COPY in pipes. I'm more familiar with mssql which has the bcp method, unable to test fully on my solution, though I imagine it's a similar method of calling through terminal. Terminal command is based off 3rd link utilizing the method, though that solution uses subprocess.call() which has since been updated with subprocess.run().

https://docs.python.org/3/library/subprocess.html#subprocess.runhttps://ieftimov.com/post/postgresql-copy/

Python psql \copy CSV to remote server

import subprocesspsql_command = "\"\copy table (col1, col2) FROM file_location CSV HEADER QUOTE '\\\"' NULL ''\""# user, hostname, password, dbname all defined elsewhere above.command = ["psql",    "-U", user,    "-h", hostname,    "-d", dbname,    "-w", password,    "-c", psql_command,]subprocess.run(command)