With pairwise summation, how many terms do I need to get an appreciably wrong result? With pairwise summation, how many terms do I need to get an appreciably wrong result? numpy numpy

With pairwise summation, how many terms do I need to get an appreciably wrong result?


Depth 1432 (so 2^1432 terms) suffices for the true sum to exceed the computed sum by a factor of two.

I had an idea for how to determine the number of terms needed to less than a factor of two.

We use dynamic programming to answer the following question: given a depth d and a target floating point sum s, what is the largest true sum of 2^d nonnegative float16s with pairwise sum s?

Let that quantity be T(d, s). We get a recurrence

T(0, s) = s,    for all s.T(d, s) =            max            (T(d-1, a) + T(d-1, b)),    for all d, s.          a, b : float16(a + b) = s

Each step of the recurrence will involve looping over about 2^29 combinations (since we can assume a ≤ b, and negative floats and special values are off limits), and the depth required won't exceed 10^4 or so by Hans's and your answer. Seems feasible to me.

DP code:

#include <algorithm>#include <cstdio>#include <vector>using Float16 = int;using Fixed = unsigned long long;static constexpr int kExponentBits = 5;static constexpr int kFractionBits = 10;static constexpr Float16 kInfinity = ((1 << kExponentBits) - 1)                                     << kFractionBits;Fixed FixedFromFloat16(Float16 a) {  int exponent = a >> kFractionBits;  if (exponent == 0) {    return a;  }  Float16 fraction = a - (exponent << kFractionBits);  Float16 significand = (1 << kFractionBits) + fraction;  return static_cast<Fixed>(significand) << (exponent - 1);}bool Plus(Float16 a, Float16 b, Float16* c) {  Fixed exact_sum = FixedFromFloat16(a) + FixedFromFloat16(b);  int exponent = 64 - kFractionBits - __builtin_clzll(exact_sum);  if (exponent <= 0) {    *c = static_cast<Float16>(exact_sum);    return true;  }  Fixed ulp = Fixed{1} << (exponent - 1);  Fixed remainder = exact_sum & (ulp - 1);  Fixed rounded_sum = exact_sum - remainder;  if (2 * remainder > ulp ||      (2 * remainder == ulp && (rounded_sum & ulp) != 0)) {    rounded_sum += ulp;  }  exponent = 64 - kFractionBits - __builtin_clzll(rounded_sum);  if (exponent >= (1 << kExponentBits) - 1) {    return false;  }  Float16 significand = rounded_sum >> (exponent - 1);  Float16 fraction = significand - (Float16{1} << kFractionBits);  *c = (exponent << kFractionBits) + fraction;  return true;}int main() {  std::vector<Fixed> greatest0(kInfinity);  for (Float16 a = 0; a < kInfinity; a++) {    greatest0[a] = FixedFromFloat16(a);  }  for (int depth = 1; true; depth++) {    auto greatest1 = greatest0;    for (Float16 a = 1; a < kInfinity; a++) {      Fixed greatest0_a = greatest0[a];      for (Float16 b = a; b < kInfinity; b++) {        Float16 c;        if (!Plus(a, b, &c)) {          continue;        }        Fixed& value = greatest1[c];        value = std::max(value, greatest0_a + greatest0[b]);      }    }    std::vector<double> ratios;    ratios.reserve(kInfinity - 1);    for (Float16 a = 1; a < kInfinity; a++) {      ratios.push_back(greatest1[a] / static_cast<double>(FixedFromFloat16(a)));    }    std::printf("depth %d, ratio = %.17g\n", depth,                *std::max_element(ratios.begin(), ratios.end()));    greatest0.swap(greatest1);  }}

I'll run this and post an update when it's done.


It would take such a large number of terms that it's effectively impossible (if zeros are allowed) or actually impossible (if zeros are not allowed, due to overflow). Wikipedia summarizes some error bounds due to Nicolas Higham. Since all of the terms are nonnegative, the condition number is 1, hence the relative error for n terms is bounded as |En|/|Sn| ≤ ε log2 n / (1 - ε log2 n), where ε is the machine epsilon. To be off by a factor of two, we would need |En| ≥ |Sn|, which is possible only if ε log2 n ≥ 1/2, which is equivalent to n ≥ 21/(2 ε) = 21024 for float16.


The remaining issue is whether the sum is so sharp that you can get a relative error of 2 in pair-wise summation if you allow zero in the sum (*).

The simple answer is yes, by padding a bad sequence for cum-sum with an exponential number of zeros, as follows (where a1, a2, a3, ... an is problematic for the normal sum):

a1,a2,a3, 0,a4, 0, 0, 0,a5, 0, 0, 0, 0, 0, 0, 0,a6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

It will generate the same sum, with the same round-off error for pair-wise summation, and you "only" need 2**(n-1) terms instead of n. Thus since 10**4 terms can generate a factor of 4 for normal summation, then 2**(10**4-1) terms can give a factor of 4 for pair-wise summation.

*: The answer by David Eistenstat shows that disallowing zero the sum will overflow before getting that problematic.(I assume that the pair-wise summation recurses to the end.)