Minimize the sum of errors of representative integers

arrays algorithm optimization combinations

Although with help from the community you have managed to state yourproblem in a form which is understandable mathematically, you still donot supply enough information to enable me (or anyone else) to giveyou a definitive answer (I would have posted this as a comment, butfor some reason I don't see the "add comment" option is available tome). In order to give a good answer to this problem, we need to knowthe relative sizes of n and k versus each other and 10000 (or theexpected maximum of the D_i if it isn't 10000), and whetherit is critical that you attain the exact minimum (even if thisrequires an exorbitant amount of time for the calculation) or if aclose approximation would be OK, also (and if so, how close do youneed to get). In addition, in order to know what algorithm runs inminimum time, we need to have an understanding about what kind ofhardware is going to run the algorithm (i.e., do we have m CPUcores to run on in parallel and what is the size of m relative to k).

Another important piece of information is whether this problem will besolved only once, or it must be solved many times but there existssome connection between the distributions of the integers D_ifrom one problem to the next (e.g., the integersD_i are all random samples from a particular, unchangingprobability distribution, or perhaps each successive problem has asits input an ever-increasing set which is the set from the previousproblem plus an extra s integers).

No reasonable algorithm for your problem should run in time whichdepends in a way greater than linear in n, since building a histogramof the n integers D_i requires O(n) time, and the answer tothe optimization problem itself is only dependent on the histogram ofthe integers and not on their ordering. No algorithm can run in timeless than O(n), since that is the size of the input of the problem.

A brute force search over all of the possibilities requires, (assumingthat at least one of the D_i is 0 and another is 10000), forsmall k, say k < 10, approximately O(10000^k-2) time, so iflog₁₀(n) >> 4(k-2), this is the optimal algorithm (since inthis case the time for the brute force search is insignificant compared to thetime to read the input). It is also interesting to note that if k isvery close to 10000, then a brute force search only requiresO(10000^10002-k) (because we can search instead over theintegers which are not used as representative integers).

If you update the definition of the problem with more information, Iwill attempt to edit my answer in turn.

arrays algorithm optimization combinations

Now the question is clarified, we observe the R_i divide the D_x into k-1 intervals, [R₁,R₂), [R₂,R₃), ... [R_k-1,R_k). Every D_x belongs to exactly one of those intervals. Let q_i be the number of D_x in the interval [R_i,R_i+1), and let s_i be the sum of those D_x. Then each error(R_i) expression is the sum of q_i terms and evaluates to s_i - q_iR_i.

Summing that over all i, we get a total error of S - sum(q_iR_i), where S is the sum of all the D_x. So the problem is to choose the R_i to maximize sum(q_iR_i). Remember each q_i is the number of original data at least as large as R_i, but smaller than the next one.

Any global maximum must be a local maximum; so we imagine increasing or decreasing one of the R_i. If R_i is not one of the original data values, then we can increase it without changing any of the q_i and improve our target function. So an optimal solution has each R_i (except the limiting last one) as one of the data values. I got a bit bogged down in math after that, but it seems a sensible approach is to pick the initial R_i as every (n/k)th data value (simple percentiles), then iteratively seeing if moving the R_i to the previous or next value improves the score and thus decreases the error. (The q_iR_i seems easier to work with, since you can read the data and count repetitions and update q_i, R_i by only looking at a single data/count point. You only need to store an array of 10,000 data counts, no matter how huge the data).

data:   1  3  7  8 14 30count:  1  2  1  1  3  1     sum(data) = 94initial R: 1  3  8  14  31initial Q: 1  3  1   4        sum(QR)  = 74 (hence error = 20)

In this example, we could try changing the 3 or the 8 to a 7, For example if we increase the 3 to 7, then we see there are 2 3's in the initial data, so the first two Q's become 1+2, 3-2 - it turns out this decreases sum(QR)). I'm sure there are smarter patterns to detect what changes in the QR table are viable, but this seems workable.

arrays algorithm optimization combinations

If the distribution is near random and the selection (n) is large enough, you are wasting time, generally, trying to optimize for what will amount to real costs in time calculating to gain decreasing improvements in % from expected averages. The fastest average solution is to set the lower k-1 at the low end of intervals M/(k-1), where M is the lowest upper bound - the greatest lower bound (ie, M = max number possible - 0) and the last k at M+1. It would take order k (the best we can do with the information presented in this problem) to figure those values out. Stating what I just did is not a proof of course.

My point is this. The above discussion is one simplification that I think is very practical for one large class of sets. At the other end, it's straightforward to compute every error possible for all permutations and then select the smallest one. The running time for this makes that solution intractable in many cases. That the person asking this question expects more than the most direct and exact (intractable) answer leaves much that is open-ended. We can trim at the edges from here to eternity trying to quantify all sorts of properties along the infinite solution space for all possible permutations (or combinations) of n numbers and all k values.

CodeHunter

Minimize the sum of errors of representative integers

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last