Maintaining a ratio when splitting up data in python function Maintaining a ratio when splitting up data in python function numpy numpy

Maintaining a ratio when splitting up data in python function


Here's an algorithm that I think might work for you.

You take the test_length and train_length and divide by their GCD to get the ratio as a simple fraction. You take the numerator and denominator and you add them together, and that is the size factor for your groups.

For example if the ratio is 3:2, the size of each group must be a multiple of 5.

You then take the total_length and divide it by the number of folds to get the ideal size for the first group, which may well be a floating point number. You find the largest multiple of 5 that is less than or equal to that, and that is your first group.

Subtract that value from your total, and divide by by folds-1 to get the ideal size for the next group. Again find the largest multiple of 5, subtract the from the total, and continue until you have calculated all the groups.

Some example code:

total_length = test_length + train_length          divisor = gcd(test_length,train_length)test_multiple = test_length/divisortrain_multiple = train_length/divisortotal_multiple = test_multiple + train_multiple # Adjust the ratio if there isn't enough data for the requested foldsif total_length/total_multiple < folds:  total_multiple = total_length/folds  test_multiple = int(round(float(test_length)*total_multiple/total_length))  train_multiple = total_multiple - test_multiplegroups = []for i in range(folds,0,-1):  float_size = float(total_length)/i  int_size = int(float_size/total_multiple)*total_multiple  test_size = int_size*test_multiple/total_multiple  train_size = int_size*train_multiple/total_multiple  test_length -= test_size    # keep track of the test data used  train_length -= train_size  # keep track of the train data used  total_length -= int_size  groups.append((test_size,train_size))# If the test_length or train_length are negative, we need to adjust the groups# to "give back" some of the data.distribute_overrun(groups,test_length,0)distribute_overrun(groups,train_length,1)

This has been updated to keep track of the size used from each group (test and train) but not worry if we use too much initially.

Then at the end, if there's any overrun (i.e. test_length or train_length have gone negative), we distribute that overrun back into the groups by decrementing the appropriate side of the ratio in as many items as it takes to bring the overrun back to zero.

The distribute_overrun function is included below.

def distribute_overrun(groups,overrun,part):    i = 0    while overrun < 0:      group = list(groups[i])      group[part] -= 1      groups[i] = tuple(group)      overrun += 1      i += 1

At the end of that, groups will be a list of tuples containing the test_size and train_size for each group.

If that sounds like the sort of thing you want, but you need me to expand on the code example, just let me know.


In another question the author wanted to do a similar cross-validation as yours. Please, take a look to this answer. Working out that answer to your problem, it would be like:

import numpy as np# in both train_data the first line is used for the cross-validation,# and the other lines will follow, so you can add as many lines as you wanttest_data = np.array([ 0.,  1.,  2.,  3.,  4.,  5.])train_data  = np.array([[ 0.09,  1.9,  1.1,  1.5,  4.2,  3.1,  5.1],                       [    3,    4,  3.1,   10,   20,    2,    3]])def cross_validation_group( test_data, train_data):    om1,om2 = np.meshgrid(test_data,train_data[0])    dist = (om1-om2)**2    indexes = np.argsort( dist, axis=0 )    return train_data[:, indexes[0]]print cross_validation_group( test_data, train_data )# array([[  0.09,   1.1 ,   1.9 ,   3.1 ,   4.2 ,   5.1 ],#        [     3 ,  3.1 ,     4 ,     2 ,    20 ,     3 ]])

You will have the train_data corresponding to the interval defined in test_data.