Pull large amounts of data from a remote server, into a DataFrame

I suspect there's a couple of (related) things at play here causing slowness:

read_sql is written in python so it's a little slow (especially compared to read_csv, which is written in cython - and carefully implemented for speed!) and it relies on sqlalchemy rather than some (potentially much faster) C-DBAPI. The impetus to move to sqlalchmey was to make that move easier in the future (as well as cross-sql-platform support).
You may be running out of memory as too many python objects are in memory (this is related to not using a C-DBAPI), but potentially could be addressed...

I think the immediate solution is a chunk-based approach (and there is a feature request to have this work natively in pandas read_sql and read_sql_table).

EDIT: As of Pandas v0.16.2 this chunk based approach is natively implemented in read_sql.

Since you're using postgres you have access the the LIMIT and OFFSET queries, which makes chunking quite easy. (Am I right in thinking these aren't available in all sql languages?)

First, get the number of rows (or an estimate) in your table:

nrows = con.execute('SELECT count(*) FROM users').fetchone()[0]  # also works with an sqlalchemy engine

Use this to iterate through the table (for debugging you could add some print statements to confirm that it was working/not crashed!) and then combine the result:

def read_sql_chunked(query, con, nrows, chunksize=1000):    start = 1    dfs = []  # Note: could probably make this neater with a generator/for loop    while start < nrows:        df = pd.read_sql("%s LIMIT %s OFFSET %s" % (query, chunksize, start), con)        dfs.append(df)    return pd.concat(dfs, ignore_index=True)

Note: this assumes that the database fits in memory! If it doesn't you'll need to work on each chunk (mapreduce style)... or invest in more memory!

python postgresql pandas psycopg2

try to use pandas:

mysql_cn = mysql.connector.connect(host='localhost', port=123, user='xyz',  passwd='****', db='xy_db')**data= pd.read_sql('SELECT * FROM table;', con=mysql_cn)mysql_cn.close()

It worked for me.

python postgresql pandas psycopg2

Here is a basic cursor example that might be of help:

import psycopg2

note that we have to import the Psycopg2 extras library!

import psycopg2.extras

import sys

def main(): conn_string = "host='localhost' dbname='my_database' user='postgres' password='secret'" ### print the connection string we will use to connect

conn = psycopg2.connect(conn_string)### HERE IS THE IMPORTANT PART, by specifying a name for the cursor### psycopg2 creates a server-side cursor, which prevents all of the### records from being downloaded at once from the server.cursor = conn.cursor('cursor_unique_name', cursor_factory=psycopg2.extras.DictCursor)cursor.execute('SELECT * FROM my_table LIMIT 1000')### Because cursor objects are iterable we can just call 'for - in' on### the cursor object and the cursor will automatically advance itself### each iteration.### This loop should run 1000 times, assuming there are at least 1000### records in 'my_table'row_count = 0for row in cursor:    row_count += 1    print "row: %s    %s\n" % (row_count, row)

if name == "main": main()

CodeHunter

Pull large amounts of data from a remote server, into a DataFrame

note that we have to import the Psycopg2 extras library!

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last