How to transform a huge CSV into SQLite using Pandas? How to transform a huge CSV into SQLite using Pandas? sqlite sqlite

How to transform a huge CSV into SQLite using Pandas?


I haven't done any work with CSVs of that size, but it sounds like the kind of thing Odo might solve quickly.

I did a cursory check of the docs, and it appears they've written something addressing the topic of larger-than-memory CSV parsing into SQL databases that specifically calls out SQLite3 as a destination.

Here's the example they publish for parsing a 33 GB text file.

In [1]: dshape = discover(resource('all.csv'))In [2]: %time t = odo('all.no.header.csv', 'sqlite:///db.db::nyc',   ...:               dshape=dshape)CPU times: user 3.09 s, sys: 819 ms, total: 3.91 sWall time: 57min 31s


This is going to be problematic with pandas due to its size. Any reason you can't use the csv module and just iterate through the file.

Basic idea (untested):

import gzipimport csvimport sqlite3with gzip.open('db.gz') as f, sqlite3.connect('db.sqlite') as cnx:    reader = csv.reader(f)    c = cnx.cursor()    c.executemany('insert into table_name values (?,?,...)', reader)


[Update at 06-15-2017]

It seems that csv2sqlite.py may be the way-to-go with SQLite. Definitely, Chuck-by-Chuck is too slow for big file (> 1GB). When I tested a 6.5GB of nyc311calls.csv with csv2sqlite.py, it took only ~ 24 minutes to create a SQLite database file with data type guessing. 24 minutes is similar to the spending time by MySQL with "LOAD DATA INFILE". This is not bad even though you may need to change the data type for some of columns. In my opinion, to use csv2sqlite.py is the most time-efficient method to create a SQLite database file from csv file right now.

1) Download a csv2sqlite.py from here and put it in the directory containing a csv file.

2) By using Windows Prompt, go to the directory containing the csv2sqlite.py and the csv file (ex. nyc311calls.csv) you want to import.

3) Run the code of python csv2sqlite.py nyc311calls.csv database_name.db and wait. Note: python PATH should be included in your Windows Environment Variables.


This is a little old quest but it seems nobody gave the clear answers. I hope my answer will help you. With Sqlite I recommend you to see this site, which gives you the idea and what you should do, a chunk-by-chunk load. I tested several approaches but until now this is the most reliable way in my opinion.

Basic procedure is like that:1) Import a small portion of the big table to pandas.2) Process and load them to SQLite.3) Keep to continue this process.

I uploaded more detailed procedure of what I did here (Jupyter file) if you are interested in. You can find the NYC311call data here

A few comments from me.

1) Odo package is not fully working if you data included the empty strings. I hope that they could improve these issues. i.e. if you data is very clean and well-organized, Odo package might be the option.

2) The above approach is a really time-consuming work. Especially, a ~6GB of table take more than 24 hours. Because pandas is slow.

3) If you will not stick to SQLite, I would say MySQL with "LOAD DATA INFILE" is a good option for you. You can find how to do with it from internet searching. As long as I tested, this is much reliable and efficient way. Later you may convert to the sqlite if you really need to use sqlite. Especially, if the data have many empty strings and datetime columns, which are needed to convert to datetime type, I definitely go with MySQL.