Python: Pre-loading memory
This could be an XY problem, the source of which being the assumption that you must use pickles at all; they're just awful to deal with due to how they manage dependencies and are fundamentally a poor choice for any long-term data storage because of it
The source financial data is almost-certainly in some tabular form to begin with, so it may be possible to request it in a friendlier format
A simple middleware to deserialize and reserialize the pickles in the meantime will smooth the transition
input -> load pickle -> write -> output
Converting your workflow to use Parquet or Feather which are designed to be efficient to read and write will almost-certainly make a considerable difference to your load speed
Further relevant links
- Answer to How to reversibly store and load a Pandas dataframe to/from disk
- What are the pros and cons of parquet format compared to other formats?
You may also be able to achieve this with hickle, which will internally use a HDH5 format, ideally making it significantly faster than pickle, while still behaving like one
An alternative to storing the unpickled data in memory would be to store the pickle in a ramdisk, so long as most of the time overhead comes from disk reads. Example code (to run in a terminal) is below.
sudo mkdir mnt/picklemount -o size=1536M -t tmpfs none /mnt/picklecp path/to/pickle.pkl mnt/pickle/pickle.pkl
Then you can access the pickle at mnt/pickle/pickle.pkl
. Note that you can change the file names and extensions to whatever you want. If disk read is not the biggest bottleneck, you might not see a speed increase. If you run out of memory, you can try turning down the size of the ramdisk (I set it at 1536 mb, or 1.5gb)
You can use shareable list:So you will have 1 python program running which will load the file and save it in memory and another python program which can take the file from memory. Your data, whatever is it you can load it in dictionary and then dump it as json and then reload json.So
Program1
import pickleimport jsonfrom multiprocessing.managers import SharedMemoryManagerYOUR_DATA=pickle.load(open(DATA_ROOT + pickle_name, 'rb'))data_dict={'DATA':YOUR_DATA}data_dict_json=json.dumps(data_dict)smm = SharedMemoryManager()smm.start() sl = smm.ShareableList(['alpha','beta',data_dict_json])print (sl)#smm.shutdown() commenting shutdown now but you will need to do it eventually
The output will look like this
#OUTPUT>>>ShareableList(['alpha', 'beta', "your data in json format"], name='psm_12abcd')
Now in Program2:
from multiprocessing import shared_memoryload_from_mem=shared_memory.ShareableList(name='psm_12abcd')load_from_mem[1]#OUTPUT'beta'load_from_mem[2]#OUTPUTyourdataindictionaryformat
You can look for more over herehttps://docs.python.org/3/library/multiprocessing.shared_memory.html