Pickle File too large to load
Looks like you're in a bit of a pickle! ;-). Hopefully after this, you'll NEVER USE PICKLE EVER. It's just not a very good data storage format.
Anyways, for this answer I'm assuming your Document
class looks a bit like this. If not, comment with your actual Document
class:
class Document(object): # <-- object part is very important! If it's not there, the format is different! def __init__(self, title, date, text): # assuming all strings self.title = title self.date = date self.text = text
Anyways, I made some simple test data with this class:
d = [Document(title='foo', text='foo is good', date='1/1/1'), Document(title='bar', text='bar is better', date='2/2/2'), Document(title='baz', text='no one likes baz :(', date='3/3/3')]
Pickled it with format 2
(pickle.HIGHEST_PROTOCOL
for Python 2.x)
>>> s = pickle.dumps(d, 2)>>> s'\x80\x02]q\x00(c__main__\nDocument\nq\x01)\x81q\x02}q\x03(U\x04dateq\x04U\x051/1/1q\x05U\x04textq\x06U\x0bfoo is goodq\x07U\x05titleq\x08U\x03fooq\tubh\x01)\x81q\n}q\x0b(h\x04U\x052/2/2q\x0ch\x06U\rbar is betterq\rh\x08U\x03barq\x0eubh\x01)\x81q\x0f}q\x10(h\x04U\x053/3/3q\x11h\x06U\x13no one likes baz :(q\x12h\x08U\x03bazq\x13ube.'
And disassembled it with pickletools
:
>>> pickletools.dis(s) 0: \x80 PROTO 2 2: ] EMPTY_LIST 3: q BINPUT 0 5: ( MARK 6: c GLOBAL '__main__ Document' 25: q BINPUT 1 27: ) EMPTY_TUPLE 28: \x81 NEWOBJ 29: q BINPUT 2 31: } EMPTY_DICT 32: q BINPUT 3 34: ( MARK 35: U SHORT_BINSTRING 'date' 41: q BINPUT 4 43: U SHORT_BINSTRING '1/1/1' 50: q BINPUT 5 52: U SHORT_BINSTRING 'text' 58: q BINPUT 6 60: U SHORT_BINSTRING 'foo is good' 73: q BINPUT 7 75: U SHORT_BINSTRING 'title' 82: q BINPUT 8 84: U SHORT_BINSTRING 'foo' 89: q BINPUT 9 91: u SETITEMS (MARK at 34) 92: b BUILD 93: h BINGET 1 95: ) EMPTY_TUPLE 96: \x81 NEWOBJ 97: q BINPUT 10 99: } EMPTY_DICT 100: q BINPUT 11 102: ( MARK 103: h BINGET 4 105: U SHORT_BINSTRING '2/2/2' 112: q BINPUT 12 114: h BINGET 6 116: U SHORT_BINSTRING 'bar is better' 131: q BINPUT 13 133: h BINGET 8 135: U SHORT_BINSTRING 'bar' 140: q BINPUT 14 142: u SETITEMS (MARK at 102) 143: b BUILD 144: h BINGET 1 146: ) EMPTY_TUPLE 147: \x81 NEWOBJ 148: q BINPUT 15 150: } EMPTY_DICT 151: q BINPUT 16 153: ( MARK 154: h BINGET 4 156: U SHORT_BINSTRING '3/3/3' 163: q BINPUT 17 165: h BINGET 6 167: U SHORT_BINSTRING 'no one likes baz :(' 188: q BINPUT 18 190: h BINGET 8 192: U SHORT_BINSTRING 'baz' 197: q BINPUT 19 199: u SETITEMS (MARK at 153) 200: b BUILD 201: e APPENDS (MARK at 5) 202: . STOP
Looks complex! But really, it's not so bad. pickle
is basically a stack machine, each ALL_CAPS identifier you see is an opcode, which manipulates the internal "stack" in some way for decoding. If we were trying to parse some complex structure, this would be more important, but luckily we're just making a simple list of essentially-tuples. All this "code" is doing is constructing a bunch of objects on the stack, and then pushing the entire stack into a list.
The one thing we DO need to care about are the 'BINPUT' / 'BINGET' opcodes you see scattered around. Basically, these are for 'memoization', to reduce data footprint, pickle
saves strings with BINPUT <id>
, and then if they come up again, instead of re-dumping them, simply puts a BINGET <id>
to retrieve them from the cache.
Also, another complication! There's more than just SHORT_BINSTRING
- there's normal BINSTRING
for strings > 256 bytes, and also some fun unicode variants as well. I'll just assume that you're using Python 2 with all ASCII strings. Again, comment if this isn't a correct assumption.
OK, so we need to stream the file until we hit a '\81' bytes (NEWOBJ
). Then, we need to scan forward until we hit a '(' (MARK
) character. Then, until we hit a 'u' (SETITEMS
), we read pairs of key/value strings - there should be 3 pairs total, one for each field.
So, lets do this. Here's my script to read pickle data in streaming fashion. It's far from perfect, since I just hacked it together for this answer, and you'll need to modify it a lot, but it's a good start.
pickledata = '\x80\x02]q\x00(c__main__\nDocument\nq\x01)\x81q\x02}q\x03(U\x04dateq\x04U\x051/1/1q\x05U\x04textq\x06U\x0bfoo is goodq\x07U\x05titleq\x08U\x03fooq\tubh\x01)\x81q\n}q\x0b(h\x04U\x052/2/2q\x0ch\x06T\x14\x05\x00\x00bar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterq\rh\x08U\x03barq\x0eubh\x01)\x81q\x0f}q\x10(h\x04U\x053/3/3q\x11h\x06U\x13no one likes baz :(q\x12h\x08U\x03bazq\x13ube.'# simulate a file hereimport StringIOpicklefile = StringIO.StringIO(pickledata)import pickle # just for opcode namesimport struct # binary unpackingdef try_memo(f, v, cache): opcode = f.read(1) if opcode == pickle.BINPUT: cache[f.read(1)] = v elif opcode == pickle.LONG_BINPUT: print 'skipping LONG_BINPUT to save memory, LONG_BINGET will probably not be used' f.read(4) else: f.seek(f.tell() - 1) # rewinddef try_read_string(f, opcode, cache): if opcode in [ pickle.SHORT_BINSTRING, pickle.BINSTRING ]: length_type = 'b' if opcode == pickle.SHORT_BINSTRING else 'i' str_length = struct.unpack(length_type, f.read(struct.calcsize(length_type)))[0] value = f.read(str_length) try_memo(f, value, memo_cache) return value elif opcode == pickle.BINGET: return memo_cache[f.read(1)] elif opcide == pickle.LONG_BINGET: raise Exception('Unexpected LONG_BINGET? Key ' + f.read(4)) else: raise Exception('Invalid opcode ' + opcode + ' at pos ' + str(f.tell()))memo_cache = {}while True: c = picklefile.read(1) if c == pickle.NEWOBJ: while picklefile.read(1) != pickle.MARK: pass # scan forward to field instantiation fields = {} while True: opcode = picklefile.read(1) if opcode == pickle.SETITEMS: break key = try_read_string(picklefile, opcode, memo_cache) value = try_read_string(picklefile, picklefile.read(1), memo_cache) fields[key] = value print 'Document', fields # insert to sqllite elif c == pickle.STOP: break
This correctly reads my test data in pickle format 2 (modified to have a long string):
$ python picklereader.pyDocument {'date': '1/1/1', 'text': 'foo is good', 'title': 'foo'}Document {'date': '2/2/2', 'text': 'bar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is better', 'title': 'bar'}Document {'date': '3/3/3', 'text': 'no one likes baz :(', 'title': 'baz'}
Good luck!
You didn't pickle your data incrementally. You pickled your data monolithically and repeatedly. Each time around the loop, you destroyed whatever output data you had (open(...,'wb')
destroys the output file), and re-wrote all of the data again. Additionally, if your program ever stopped and then restarted with new input data, the old output data was lost.
I do not know why objects
didn't cause an out-of-memory error while you were pickling, because it grew to the same size as the object that pickle.load()
wants to create.
Here is how you could have created the pickle file incrementally:
def save_objects(objects): with open('objects.pkl', 'ab') as output: # Note: `ab` appends the data pickle.dump(objects, output, pickle.HIGHEST_PROTOCOL)def Main(): ... #objects=[] <-- lose the objects list with open('links2.txt', 'rb') as infile: for link in infile: ... save_objects(article)
Then you could have incrementally read the pickle file like so:
import picklewith open('objects.pkl', 'rb') as pickle_file: try: while True: article = pickle.load(pickle_file) print article except EOFError: pass
The choices I can think of are:
- Try cPickle. It might help.
- Try streaming-pickle
- Read your pickle file in a 64-bit environment with lots and lots of RAM
- Re-crawl the original data, this time actually incrementally storing the data, or storing it in a database.Without the inefficiency of constantly re-writing your pickle output file, your crawling might go significantly faster this time.
I had very similar case lately - a 11 GB pickle. I did not try to load it on my machine incrementally as I didn't have enough time to implement own incremental loader or refine existing ones for my case.
What I did is that I started up a big instance with enough memory in cloud hosting provider (price is not much if started only for small amount of time like few hours), uploaded that file over SSH (SCP) to that server and simply loaded it on that instance to analyze it there + re-write it into more suitable format.
Not a programming solution, but time-effective (low effort).