Why is the curve of my permutation test analysis not smooth? Why is the curve of my permutation test analysis not smooth? numpy numpy

Why is the curve of my permutation test analysis not smooth?


I guess this is the classical multiprocessing mistake. Nothing guarantees that the processes will finish in the same order as the one they started. This means that you cannot be sure that the instruction allResults[timesOfInterestFramesIterIdx, :, :] = oneResult will store the result of process 'timesOfInterestFramesIterIdx' at the location 'timesOfInterestFramesIterIdx' in allResults. To make it clearer, let's say 'timesOfInterestFramesIterIdx' is 2, then you have absolutely no guarantee that oneResult is the output of process 2.

I have implemented a very quick fix below. The idea is to track the order in which the processes have been launched by adding an extra argument to groupDiffsInParallel which is then stored in the queue and thereby serves as a process identifier when the results are gathered.

import matplotlib.pyplot as pltimport numpy as npfrom multiprocessing import cpu_count, Process, Queueimport matplotlib.pylab as pldef groupDiffsInParallel(queue, d1, d2, nrOfReplicas, nrOfPermuts,                         timesOfInterestFramesIter,                         timesOfInterestFramesIterIdx):    allResults = np.zeros([nrOfReplicas, nrOfPermuts])  # e.g. 100 x 3000    for repsPerGroupIdx in range(1, nrOfReplicas + 1):        for permutIdx in range(nrOfPermuts):            d1TimeCut = d1[:, 0:int(timesOfInterestFramesIter)]            d1Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)            d1Sel = d1TimeCut[d1Idxs, :]            d1Mean = np.mean(d1Sel.flatten())            d2TimeCut = d2[:, 0:int(timesOfInterestFramesIter)]            d2Idxs = np.random.randint(0, nrOfReplicas, size=repsPerGroupIdx)            d2Sel = d2TimeCut[d2Idxs, :]            d2Mean = np.mean(d2Sel.flatten())            diff = d1Mean - d2Mean            allResults[repsPerGroupIdx - 1, permutIdx] = np.abs(diff)    queue.put({'allResults': allResults,               'number': timesOfInterestFramesIterIdx})def evalDifferences_parallel (d1, d2):    # d1 and d2 are of size reps x time (e.g. 100x801)    nrOfReplicas = d1.shape[0]    nrOfFrames = d1.shape[1]    timesOfInterestNs = [0.25, 0.5, 1, 2, 3, 4, 5, 10, 20, 30, 40, 50, 60, 70,                         80, 90, 100]  # 17    nrOfTimesOfInterest = len(timesOfInterestNs)    framesPerNs = (nrOfFrames-1)/100  # sim time == 100 ns    timesOfInterestFrames = [x*framesPerNs for x in timesOfInterestNs]    nrOfPermuts = 5000    allResults = np.zeros([nrOfTimesOfInterest, nrOfReplicas,                           nrOfPermuts])  # e.g. 17 x 100 x 3000    nrOfProcesses = cpu_count()    print('{} cores available'.format(nrOfProcesses))    queue = Queue()    jobs = []    print('Starting ...')    # use one process for each time cut    for timesOfInterestFramesIterIdx, timesOfInterestFramesIter \            in enumerate(timesOfInterestFrames):        p = Process(target=groupDiffsInParallel,                    args=(queue, d1, d2, nrOfReplicas, nrOfPermuts,                          timesOfInterestFramesIter,                          timesOfInterestFramesIterIdx))        p.start()        jobs.append(p)        print('Process {} started work on time \"{} ns\"'.format(            timesOfInterestFramesIterIdx,            timesOfInterestNs[timesOfInterestFramesIterIdx]),              end='\n', flush=True)    # collect the results    resultdict = {}    for timesOfInterestFramesIterIdx, timesOfInterestFramesIter \            in enumerate(timesOfInterestFrames):        resultdict.update(queue.get())        allResults[resultdict['number'], :, :] = resultdict['allResults']        print('Process number {} returned the results.'.format(            resultdict['number']), end='\n', flush=True)    # hold main thread and wait for the child process to complete. then join    # back the resources in the main thread    for proc in jobs:        proc.join()    print("All parallel done.")    allResultsMeanOverPermuts = allResults.mean(axis=2)  # size: 17 x 100    replicaNumbersToPlot = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40,                                     50, 60, 70, 80, 90, 100])    replicaNumbersToPlot -= 1  # zero index!    colors = pl.cm.jet(np.linspace(0, 1, len(replicaNumbersToPlot)))    ctr = 0    f, ax = plt.subplots(nrows=2, ncols=2, figsize=(12, 12))    axId = (1, 0)    for lineIdx in replicaNumbersToPlot:        lineData = allResultsMeanOverPermuts[:, lineIdx]        ax[axId].plot(lineData, ".-", color=colors[ctr], linewidth=0.5,                      label="nReps="+str(lineIdx+1))        ctr += 1    ax[axId].set_xticks(range(nrOfTimesOfInterest))    # careful: this is not the same as plt.xticks!!    ax[axId].set_xticklabels(timesOfInterestNs)    ax[axId].set_xlabel("simulation length taken into account")    ax[axId].set_ylabel("average difference between mean values boot "                        + "strapping samples")    ax[axId].set_xlim([ax[axId].get_xlim()[0], ax[axId].get_xlim()[1]+1])    # increase x max by 2    plt.show()# #### MAIN ####np.random.seed(83737)  # some number for reproducibilityd1 = np.random.rand(100, 801)d2 = np.random.rand(100, 801)np.random.seed(52389)  # if changed to 324235 the peak is goneevalDifferences_parallel(d1, d2)

This is the output I get, which obviously shows that the order in which the processes return is shuffled compared to the starting order.

20 cores availableStarting ...Process 0 started work on time "0.25 ns"Process 1 started work on time "0.5 ns"Process 2 started work on time "1 ns"Process 3 started work on time "2 ns"Process 4 started work on time "3 ns"Process 5 started work on time "4 ns"Process 6 started work on time "5 ns"Process 7 started work on time "10 ns"Process 8 started work on time "20 ns"Process 9 started work on time "30 ns"Process 10 started work on time "40 ns"Process 11 started work on time "50 ns"Process 12 started work on time "60 ns"Process 13 started work on time "70 ns"Process 14 started work on time "80 ns"Process 15 started work on time "90 ns"Process 16 started work on time "100 ns"Process number 3 returned the results.Process number 0 returned the results.Process number 4 returned the results.Process number 7 returned the results.Process number 1 returned the results.Process number 2 returned the results.Process number 5 returned the results.Process number 8 returned the results.Process number 6 returned the results.Process number 9 returned the results.Process number 10 returned the results.Process number 11 returned the results.Process number 12 returned the results.Process number 13 returned the results.Process number 14 returned the results.Process number 15 returned the results.Process number 16 returned the results.All parallel done.

And the figure which is produced.

Correct output


not sure if you're still hung up on this issue, but I just ran your code on my machine (MacBook Pro (15-inch, 2018)) in Jupyter 4.4.0 and my graphs are smooth with the exact same seed values you originally posted:

##### MAIN ####np.random.seed(83737)  # some number for reproducibilityd1 = np.random.rand(100, 801)d2 = np.random.rand(100, 801)np.random.seed(52389)  # if changed to 324235 the peak is goneevalDifferences_parallel(d1, d2) 

Plot

Perhaps there's nothing wrong with your code and nothing special about the 324235 seed and you just need to double check your module versions since any changes to the source code that have been made in more recent versions could affect your results. For reference I'm using numpy 1.15.4, matplotlib 3.0.2 and multiprocessing 2.6.2.1.