Minimize the sum of errors of representative integers Minimize the sum of errors of representative integers arrays arrays

Minimize the sum of errors of representative integers


Although with help from the community you have managed to state yourproblem in a form which is understandable mathematically, you still donot supply enough information to enable me (or anyone else) to giveyou a definitive answer (I would have posted this as a comment, butfor some reason I don't see the "add comment" option is available tome). In order to give a good answer to this problem, we need to knowthe relative sizes of n and k versus each other and 10000 (or theexpected maximum of the Di if it isn't 10000), and whetherit is critical that you attain the exact minimum (even if thisrequires an exorbitant amount of time for the calculation) or if aclose approximation would be OK, also (and if so, how close do youneed to get). In addition, in order to know what algorithm runs inminimum time, we need to have an understanding about what kind ofhardware is going to run the algorithm (i.e., do we have m CPUcores to run on in parallel and what is the size of m relative to k).

Another important piece of information is whether this problem will besolved only once, or it must be solved many times but there existssome connection between the distributions of the integers Difrom one problem to the next (e.g., the integersDi are all random samples from a particular, unchangingprobability distribution, or perhaps each successive problem has asits input an ever-increasing set which is the set from the previousproblem plus an extra s integers).

No reasonable algorithm for your problem should run in time whichdepends in a way greater than linear in n, since building a histogramof the n integers Di requires O(n) time, and the answer tothe optimization problem itself is only dependent on the histogram ofthe integers and not on their ordering. No algorithm can run in timeless than O(n), since that is the size of the input of the problem.

A brute force search over all of the possibilities requires, (assumingthat at least one of the Di is 0 and another is 10000), forsmall k, say k < 10, approximately O(10000k-2) time, so iflog10(n) >> 4(k-2), this is the optimal algorithm (since inthis case the time for the brute force search is insignificant compared to thetime to read the input). It is also interesting to note that if k isvery close to 10000, then a brute force search only requiresO(1000010002-k) (because we can search instead over theintegers which are not used as representative integers).

If you update the definition of the problem with more information, Iwill attempt to edit my answer in turn.


Now the question is clarified, we observe the Ri divide the Dx into k-1 intervals, [R1,R2), [R2,R3), ... [Rk-1,Rk). Every Dx belongs to exactly one of those intervals. Let qi be the number of Dx in the interval [Ri,Ri+1), and let si be the sum of those Dx. Then each error(Ri) expression is the sum of qi terms and evaluates to si - qiRi.

Summing that over all i, we get a total error of S - sum(qiRi), where S is the sum of all the Dx. So the problem is to choose the Ri to maximize sum(qiRi). Remember each qi is the number of original data at least as large as Ri, but smaller than the next one.

Any global maximum must be a local maximum; so we imagine increasing or decreasing one of the Ri. If Ri is not one of the original data values, then we can increase it without changing any of the qi and improve our target function. So an optimal solution has each Ri (except the limiting last one) as one of the data values. I got a bit bogged down in math after that, but it seems a sensible approach is to pick the initial Ri as every (n/k)th data value (simple percentiles), then iteratively seeing if moving the R_i to the previous or next value improves the score and thus decreases the error. (The qiRi seems easier to work with, since you can read the data and count repetitions and update qi, Ri by only looking at a single data/count point. You only need to store an array of 10,000 data counts, no matter how huge the data).

data:   1  3  7  8 14 30count:  1  2  1  1  3  1     sum(data) = 94initial R: 1  3  8  14  31initial Q: 1  3  1   4        sum(QR)  = 74 (hence error = 20)

In this example, we could try changing the 3 or the 8 to a 7, For example if we increase the 3 to 7, then we see there are 2 3's in the initial data, so the first two Q's become 1+2, 3-2 - it turns out this decreases sum(QR)). I'm sure there are smarter patterns to detect what changes in the QR table are viable, but this seems workable.


If the distribution is near random and the selection (n) is large enough, you are wasting time, generally, trying to optimize for what will amount to real costs in time calculating to gain decreasing improvements in % from expected averages. The fastest average solution is to set the lower k-1 at the low end of intervals M/(k-1), where M is the lowest upper bound - the greatest lower bound (ie, M = max number possible - 0) and the last k at M+1. It would take order k (the best we can do with the information presented in this problem) to figure those values out. Stating what I just did is not a proof of course.

My point is this. The above discussion is one simplification that I think is very practical for one large class of sets. At the other end, it's straightforward to compute every error possible for all permutations and then select the smallest one. The running time for this makes that solution intractable in many cases. That the person asking this question expects more than the most direct and exact (intractable) answer leaves much that is open-ended. We can trim at the edges from here to eternity trying to quantify all sorts of properties along the infinite solution space for all possible permutations (or combinations) of n numbers and all k values.