scikit-learn joblib bug: multiprocessing pool self.value out of range for 'i' format code, only with large numpy arrays scikit-learn joblib bug: multiprocessing pool self.value out of range for 'i' format code, only with large numpy arrays numpy numpy

scikit-learn joblib bug: multiprocessing pool self.value out of range for 'i' format code, only with large numpy arrays


As a workaround you can try to memory map your data explicitly & manually as explained in the joblib documentation.

Edit #1: Here is the important part:

from sklearn.externals import joblibjoblib.dump(X_train, some_filename)X_train = joblib.load(some_filename, mmap_mode='r+')

Then pass this memmap'ed data to GridSearchCV under scikit-learn 0.15+.

Edit #2: Furthermore: if you use the 32bit version of Anaconda, you will be limited to 2GB per python process which can also limit the memory.

I just found a bug for numpy.save under Python 3.4 but even when fixed the subsequent call to mmap will fail with:

OSError: [WinError 8] Not enough storage is available to process this command

So please use a 64 bit version of Python (with Anaconda as AFAIK there is currently no other 64bit packages for numpy / scipy / scikit-learn==0.15.0b1 at this time).

Edit #3: I found another issue that might be causing excessive memory usage under windows: currently joblib.Parallel memory maps input data with mmap_mode='c' by default: this copy-on-write setting seems to cause windows to exhaust the paging file and sometimes triggers "[error 1455] the paging file is too small for this operation to complete" errors. Setting mmap_mode='r' or mmap_mode='r+' does not trigger that problem. I will run tests to see if I can change the default mode in the next version of joblib.