Pull large amounts of data from a remote server, into a DataFrame Pull large amounts of data from a remote server, into a DataFrame postgresql postgresql

Pull large amounts of data from a remote server, into a DataFrame


I suspect there's a couple of (related) things at play here causing slowness:

  1. read_sql is written in python so it's a little slow (especially compared to read_csv, which is written in cython - and carefully implemented for speed!) and it relies on sqlalchemy rather than some (potentially much faster) C-DBAPI. The impetus to move to sqlalchmey was to make that move easier in the future (as well as cross-sql-platform support).
  2. You may be running out of memory as too many python objects are in memory (this is related to not using a C-DBAPI), but potentially could be addressed...

I think the immediate solution is a chunk-based approach (and there is a feature request to have this work natively in pandas read_sql and read_sql_table).

EDIT: As of Pandas v0.16.2 this chunk based approach is natively implemented in read_sql.


Since you're using postgres you have access the the LIMIT and OFFSET queries, which makes chunking quite easy. (Am I right in thinking these aren't available in all sql languages?)

First, get the number of rows (or an estimate) in your table:

nrows = con.execute('SELECT count(*) FROM users').fetchone()[0]  # also works with an sqlalchemy engine

Use this to iterate through the table (for debugging you could add some print statements to confirm that it was working/not crashed!) and then combine the result:

def read_sql_chunked(query, con, nrows, chunksize=1000):    start = 1    dfs = []  # Note: could probably make this neater with a generator/for loop    while start < nrows:        df = pd.read_sql("%s LIMIT %s OFFSET %s" % (query, chunksize, start), con)        dfs.append(df)    return pd.concat(dfs, ignore_index=True)

Note: this assumes that the database fits in memory! If it doesn't you'll need to work on each chunk (mapreduce style)... or invest in more memory!


try to use pandas:

mysql_cn = mysql.connector.connect(host='localhost', port=123, user='xyz',  passwd='****', db='xy_db')**data= pd.read_sql('SELECT * FROM table;', con=mysql_cn)mysql_cn.close()

It worked for me.


Here is a basic cursor example that might be of help:

import psycopg2

note that we have to import the Psycopg2 extras library!

import psycopg2.extras

import sys

def main(): conn_string = "host='localhost' dbname='my_database' user='postgres' password='secret'" ### print the connection string we will use to connect

conn = psycopg2.connect(conn_string)### HERE IS THE IMPORTANT PART, by specifying a name for the cursor### psycopg2 creates a server-side cursor, which prevents all of the### records from being downloaded at once from the server.cursor = conn.cursor('cursor_unique_name', cursor_factory=psycopg2.extras.DictCursor)cursor.execute('SELECT * FROM my_table LIMIT 1000')### Because cursor objects are iterable we can just call 'for - in' on### the cursor object and the cursor will automatically advance itself### each iteration.### This loop should run 1000 times, assuming there are at least 1000### records in 'my_table'row_count = 0for row in cursor:    row_count += 1    print "row: %s    %s\n" % (row_count, row)

if name == "main": main()