Maintaining a ratio when splitting up data in python function

Here's an algorithm that I think might work for you.

You take the test_length and train_length and divide by their GCD to get the ratio as a simple fraction. You take the numerator and denominator and you add them together, and that is the size factor for your groups.

For example if the ratio is 3:2, the size of each group must be a multiple of 5.

You then take the total_length and divide it by the number of folds to get the ideal size for the first group, which may well be a floating point number. You find the largest multiple of 5 that is less than or equal to that, and that is your first group.

Subtract that value from your total, and divide by by folds-1 to get the ideal size for the next group. Again find the largest multiple of 5, subtract the from the total, and continue until you have calculated all the groups.

Some example code:

total_length = test_length + train_length          divisor = gcd(test_length,train_length)test_multiple = test_length/divisortrain_multiple = train_length/divisortotal_multiple = test_multiple + train_multiple # Adjust the ratio if there isn't enough data for the requested foldsif total_length/total_multiple < folds:  total_multiple = total_length/folds  test_multiple = int(round(float(test_length)*total_multiple/total_length))  train_multiple = total_multiple - test_multiplegroups = []for i in range(folds,0,-1):  float_size = float(total_length)/i  int_size = int(float_size/total_multiple)*total_multiple  test_size = int_size*test_multiple/total_multiple  train_size = int_size*train_multiple/total_multiple  test_length -= test_size    # keep track of the test data used  train_length -= train_size  # keep track of the train data used  total_length -= int_size  groups.append((test_size,train_size))# If the test_length or train_length are negative, we need to adjust the groups# to "give back" some of the data.distribute_overrun(groups,test_length,0)distribute_overrun(groups,train_length,1)

This has been updated to keep track of the size used from each group (test and train) but not worry if we use too much initially.

Then at the end, if there's any overrun (i.e. test_length or train_length have gone negative), we distribute that overrun back into the groups by decrementing the appropriate side of the ratio in as many items as it takes to bring the overrun back to zero.

The distribute_overrun function is included below.

def distribute_overrun(groups,overrun,part):    i = 0    while overrun < 0:      group = list(groups[i])      group[part] -= 1      groups[i] = tuple(group)      overrun += 1      i += 1

At the end of that, groups will be a list of tuples containing the test_size and train_size for each group.

If that sounds like the sort of thing you want, but you need me to expand on the code example, just let me know.

python function numpy

In another question the author wanted to do a similar cross-validation as yours. Please, take a look to this answer. Working out that answer to your problem, it would be like:

import numpy as np# in both train_data the first line is used for the cross-validation,# and the other lines will follow, so you can add as many lines as you wanttest_data = np.array([ 0.,  1.,  2.,  3.,  4.,  5.])train_data  = np.array([[ 0.09,  1.9,  1.1,  1.5,  4.2,  3.1,  5.1],                       [    3,    4,  3.1,   10,   20,    2,    3]])def cross_validation_group( test_data, train_data):    om1,om2 = np.meshgrid(test_data,train_data[0])    dist = (om1-om2)**2    indexes = np.argsort( dist, axis=0 )    return train_data[:, indexes[0]]print cross_validation_group( test_data, train_data )# array([[  0.09,   1.1 ,   1.9 ,   3.1 ,   4.2 ,   5.1 ],#        [     3 ,  3.1 ,     4 ,     2 ,    20 ,     3 ]])

You will have the train_data corresponding to the interval defined in test_data.

CodeHunter

Maintaining a ratio when splitting up data in python function

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last