Pickle File too large to load Pickle File too large to load sql sql

Pickle File too large to load


Looks like you're in a bit of a pickle! ;-). Hopefully after this, you'll NEVER USE PICKLE EVER. It's just not a very good data storage format.

Anyways, for this answer I'm assuming your Document class looks a bit like this. If not, comment with your actual Document class:

class Document(object): # <-- object part is very important! If it's not there, the format is different!    def __init__(self, title, date, text): # assuming all strings        self.title = title        self.date = date        self.text = text

Anyways, I made some simple test data with this class:

d = [Document(title='foo', text='foo is good', date='1/1/1'), Document(title='bar', text='bar is better', date='2/2/2'), Document(title='baz', text='no one likes baz :(', date='3/3/3')]

Pickled it with format 2 (pickle.HIGHEST_PROTOCOL for Python 2.x)

>>> s = pickle.dumps(d, 2)>>> s'\x80\x02]q\x00(c__main__\nDocument\nq\x01)\x81q\x02}q\x03(U\x04dateq\x04U\x051/1/1q\x05U\x04textq\x06U\x0bfoo is goodq\x07U\x05titleq\x08U\x03fooq\tubh\x01)\x81q\n}q\x0b(h\x04U\x052/2/2q\x0ch\x06U\rbar is betterq\rh\x08U\x03barq\x0eubh\x01)\x81q\x0f}q\x10(h\x04U\x053/3/3q\x11h\x06U\x13no one likes baz :(q\x12h\x08U\x03bazq\x13ube.'

And disassembled it with pickletools:

>>> pickletools.dis(s)    0: \x80 PROTO      2    2: ]    EMPTY_LIST    3: q    BINPUT     0    5: (    MARK    6: c        GLOBAL     '__main__ Document'   25: q        BINPUT     1   27: )        EMPTY_TUPLE   28: \x81     NEWOBJ   29: q        BINPUT     2   31: }        EMPTY_DICT   32: q        BINPUT     3   34: (        MARK   35: U            SHORT_BINSTRING 'date'   41: q            BINPUT     4   43: U            SHORT_BINSTRING '1/1/1'   50: q            BINPUT     5   52: U            SHORT_BINSTRING 'text'   58: q            BINPUT     6   60: U            SHORT_BINSTRING 'foo is good'   73: q            BINPUT     7   75: U            SHORT_BINSTRING 'title'   82: q            BINPUT     8   84: U            SHORT_BINSTRING 'foo'   89: q            BINPUT     9   91: u            SETITEMS   (MARK at 34)   92: b        BUILD   93: h        BINGET     1   95: )        EMPTY_TUPLE   96: \x81     NEWOBJ   97: q        BINPUT     10   99: }        EMPTY_DICT  100: q        BINPUT     11  102: (        MARK  103: h            BINGET     4  105: U            SHORT_BINSTRING '2/2/2'  112: q            BINPUT     12  114: h            BINGET     6  116: U            SHORT_BINSTRING 'bar is better'  131: q            BINPUT     13  133: h            BINGET     8  135: U            SHORT_BINSTRING 'bar'  140: q            BINPUT     14  142: u            SETITEMS   (MARK at 102)  143: b        BUILD  144: h        BINGET     1  146: )        EMPTY_TUPLE  147: \x81     NEWOBJ  148: q        BINPUT     15  150: }        EMPTY_DICT  151: q        BINPUT     16  153: (        MARK  154: h            BINGET     4  156: U            SHORT_BINSTRING '3/3/3'  163: q            BINPUT     17  165: h            BINGET     6  167: U            SHORT_BINSTRING 'no one likes baz :('  188: q            BINPUT     18  190: h            BINGET     8  192: U            SHORT_BINSTRING 'baz'  197: q            BINPUT     19  199: u            SETITEMS   (MARK at 153)  200: b        BUILD  201: e        APPENDS    (MARK at 5)  202: .    STOP

Looks complex! But really, it's not so bad. pickle is basically a stack machine, each ALL_CAPS identifier you see is an opcode, which manipulates the internal "stack" in some way for decoding. If we were trying to parse some complex structure, this would be more important, but luckily we're just making a simple list of essentially-tuples. All this "code" is doing is constructing a bunch of objects on the stack, and then pushing the entire stack into a list.

The one thing we DO need to care about are the 'BINPUT' / 'BINGET' opcodes you see scattered around. Basically, these are for 'memoization', to reduce data footprint, pickle saves strings with BINPUT <id>, and then if they come up again, instead of re-dumping them, simply puts a BINGET <id> to retrieve them from the cache.

Also, another complication! There's more than just SHORT_BINSTRING - there's normal BINSTRING for strings > 256 bytes, and also some fun unicode variants as well. I'll just assume that you're using Python 2 with all ASCII strings. Again, comment if this isn't a correct assumption.

OK, so we need to stream the file until we hit a '\81' bytes (NEWOBJ). Then, we need to scan forward until we hit a '(' (MARK) character. Then, until we hit a 'u' (SETITEMS), we read pairs of key/value strings - there should be 3 pairs total, one for each field.

So, lets do this. Here's my script to read pickle data in streaming fashion. It's far from perfect, since I just hacked it together for this answer, and you'll need to modify it a lot, but it's a good start.

pickledata = '\x80\x02]q\x00(c__main__\nDocument\nq\x01)\x81q\x02}q\x03(U\x04dateq\x04U\x051/1/1q\x05U\x04textq\x06U\x0bfoo is goodq\x07U\x05titleq\x08U\x03fooq\tubh\x01)\x81q\n}q\x0b(h\x04U\x052/2/2q\x0ch\x06T\x14\x05\x00\x00bar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterq\rh\x08U\x03barq\x0eubh\x01)\x81q\x0f}q\x10(h\x04U\x053/3/3q\x11h\x06U\x13no one likes baz :(q\x12h\x08U\x03bazq\x13ube.'# simulate a file hereimport StringIOpicklefile = StringIO.StringIO(pickledata)import pickle # just for opcode namesimport struct # binary unpackingdef try_memo(f, v, cache):    opcode = f.read(1)    if opcode == pickle.BINPUT:        cache[f.read(1)] = v    elif opcode == pickle.LONG_BINPUT:        print 'skipping LONG_BINPUT to save memory, LONG_BINGET will probably not be used'        f.read(4)    else:        f.seek(f.tell() - 1) # rewinddef try_read_string(f, opcode, cache):    if opcode in [ pickle.SHORT_BINSTRING, pickle.BINSTRING ]:        length_type = 'b' if opcode == pickle.SHORT_BINSTRING else 'i'        str_length = struct.unpack(length_type, f.read(struct.calcsize(length_type)))[0]        value = f.read(str_length)        try_memo(f, value, memo_cache)        return value    elif opcode == pickle.BINGET:        return memo_cache[f.read(1)]    elif opcide == pickle.LONG_BINGET:        raise Exception('Unexpected LONG_BINGET? Key ' + f.read(4))    else:        raise Exception('Invalid opcode ' + opcode + ' at pos ' + str(f.tell()))memo_cache = {}while True:    c = picklefile.read(1)    if c == pickle.NEWOBJ:        while picklefile.read(1) != pickle.MARK:            pass # scan forward to field instantiation        fields = {}        while True:            opcode = picklefile.read(1)            if opcode == pickle.SETITEMS:                break            key = try_read_string(picklefile, opcode, memo_cache)            value = try_read_string(picklefile, picklefile.read(1), memo_cache)            fields[key] = value        print 'Document', fields        # insert to sqllite    elif c == pickle.STOP:        break

This correctly reads my test data in pickle format 2 (modified to have a long string):

$ python picklereader.pyDocument {'date': '1/1/1', 'text': 'foo is good', 'title': 'foo'}Document {'date': '2/2/2', 'text': 'bar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is better', 'title': 'bar'}Document {'date': '3/3/3', 'text': 'no one likes baz :(', 'title': 'baz'}

Good luck!


You didn't pickle your data incrementally. You pickled your data monolithically and repeatedly. Each time around the loop, you destroyed whatever output data you had (open(...,'wb') destroys the output file), and re-wrote all of the data again. Additionally, if your program ever stopped and then restarted with new input data, the old output data was lost.

I do not know why objects didn't cause an out-of-memory error while you were pickling, because it grew to the same size as the object that pickle.load() wants to create.

Here is how you could have created the pickle file incrementally:

def save_objects(objects):     with open('objects.pkl', 'ab') as output:  # Note: `ab` appends the data        pickle.dump(objects, output, pickle.HIGHEST_PROTOCOL)def Main():    ...    #objects=[] <-- lose the objects list    with open('links2.txt', 'rb') as infile:        for link in infile:             ...             save_objects(article)

Then you could have incrementally read the pickle file like so:

import picklewith open('objects.pkl', 'rb') as pickle_file:    try:        while True:            article = pickle.load(pickle_file)            print article    except EOFError:        pass

The choices I can think of are:

  • Try cPickle. It might help.
  • Try streaming-pickle
  • Read your pickle file in a 64-bit environment with lots and lots of RAM
  • Re-crawl the original data, this time actually incrementally storing the data, or storing it in a database.Without the inefficiency of constantly re-writing your pickle output file, your crawling might go significantly faster this time.


I had very similar case lately - a 11 GB pickle. I did not try to load it on my machine incrementally as I didn't have enough time to implement own incremental loader or refine existing ones for my case.

What I did is that I started up a big instance with enough memory in cloud hosting provider (price is not much if started only for small amount of time like few hours), uploaded that file over SSH (SCP) to that server and simply loaded it on that instance to analyze it there + re-write it into more suitable format.

Not a programming solution, but time-effective (low effort).