Pull large amounts of data from a remote server, into a DataFrame
I suspect there's a couple of (related) things at play here causing slowness:
read_sql
is written in python so it's a little slow (especially compared toread_csv
, which is written in cython - and carefully implemented for speed!) and it relies on sqlalchemy rather than some (potentially much faster) C-DBAPI. The impetus to move to sqlalchmey was to make that move easier in the future (as well as cross-sql-platform support).- You may be running out of memory as too many python objects are in memory (this is related to not using a C-DBAPI), but potentially could be addressed...
I think the immediate solution is a chunk-based approach (and there is a feature request to have this work natively in pandas read_sql
and read_sql_table
).
EDIT: As of Pandas v0.16.2 this chunk based approach is natively implemented in read_sql
.
Since you're using postgres you have access the the LIMIT and OFFSET queries, which makes chunking quite easy. (Am I right in thinking these aren't available in all sql languages?)
First, get the number of rows (or an estimate) in your table:
nrows = con.execute('SELECT count(*) FROM users').fetchone()[0] # also works with an sqlalchemy engine
Use this to iterate through the table (for debugging you could add some print statements to confirm that it was working/not crashed!) and then combine the result:
def read_sql_chunked(query, con, nrows, chunksize=1000): start = 1 dfs = [] # Note: could probably make this neater with a generator/for loop while start < nrows: df = pd.read_sql("%s LIMIT %s OFFSET %s" % (query, chunksize, start), con) dfs.append(df) return pd.concat(dfs, ignore_index=True)
Note: this assumes that the database fits in memory! If it doesn't you'll need to work on each chunk (mapreduce style)... or invest in more memory!
try to use pandas:
mysql_cn = mysql.connector.connect(host='localhost', port=123, user='xyz', passwd='****', db='xy_db')**data= pd.read_sql('SELECT * FROM table;', con=mysql_cn)mysql_cn.close()
It worked for me.
Here is a basic cursor example that might be of help:
import psycopg2
note that we have to import the Psycopg2 extras library!
import psycopg2.extras
import sys
def main(): conn_string = "host='localhost' dbname='my_database' user='postgres' password='secret'" ### print the connection string we will use to connect
conn = psycopg2.connect(conn_string)### HERE IS THE IMPORTANT PART, by specifying a name for the cursor### psycopg2 creates a server-side cursor, which prevents all of the### records from being downloaded at once from the server.cursor = conn.cursor('cursor_unique_name', cursor_factory=psycopg2.extras.DictCursor)cursor.execute('SELECT * FROM my_table LIMIT 1000')### Because cursor objects are iterable we can just call 'for - in' on### the cursor object and the cursor will automatically advance itself### each iteration.### This loop should run 1000 times, assuming there are at least 1000### records in 'my_table'row_count = 0for row in cursor: row_count += 1 print "row: %s %s\n" % (row_count, row)
if name == "main": main()