Any ideas why R and Python's NumPy scaling of vectors is not matching? Any ideas why R and Python's NumPy scaling of vectors is not matching? numpy numpy

Any ideas why R and Python's NumPy scaling of vectors is not matching?


For the std, which is clearly off by some substantial amount, in numpy, std returns sqrt(sum((x-x.mean())**2)) / (n-ddof) where ddof=0 by default. I guess R assumes ddof=1, because:

In [7]: s.std()Out[7]: 12.137473069268983In [8]: s.std(ddof=1)Out[8]: 12.255890244843339

and:

> sd(s)[1] 12.25589

I can't explain the mean, but since it's basically zero in each case, I would call it precision issues. numpy would report them as being 'close enough' under default tolerances:

In [5]: np.isclose(s.mean(), 1.24345e-14)Out[5]: True

The other answers discuss this issue better than I can.


This sheds light on some of it, using plain Python, with the s list as given in the original post:

>>> import math>>> sum(s) / len(s)1.3664283380001927e-14>>> math.fsum(s) / len(s)1.2434497875801753e-14

The first output reproduces np.mean(), and the second reproduces the R mean() (I'm sure that if the R code had used options(digits=17) they'd be identical).

The difference in Python is that sum() adds "left to right" suffering a rounding error after each addition, while math.fsum() conceptually computes an infinite-precision sum, with a grand total of one rounding at the end to replace the infinite-precision sum with the closest representable double precision number.

Dollars to doughnuts says that's what R does too. That would explain why @John reports that R returns the same mean regardless of the order of the numbers in s (an infinite-precision sum is wholly insensitive to the order of the summands).

I don't think that's the end of it, though. R is probably using a better numerical method for computing std dev too - "better" in the sense of smaller numeric error, but probably "worse" in the sense of taking more time to compute.

Note that PEP 450 - "Adding A Statistics Module To The Standard Library" was recently accepted for Python. That will add some high-quality (numerically) implementations of these things to the standard library. Of course it's up to numpy to decide whether they want to use these too.

About std dev

Because the mean is close to 0 no matter how it's computed, and the numbers in s aren't close to 0 at all, the difference in the computed means is pretty much irrelevant. To prove this, here's a building block that does infinite-precision calculation (again plain Python):

from fractions import Fractiondef sumsq(xs):    fs = [Fraction(x) for x in xs]    mean = sum(fs) / len(fs)    return sum((f - mean)**2 for f in fs)

Now we can use that to produce very high quality (and very slow!) estimations of population and sample standard deviation:

>>> ss = sumsq(s)>>> ss  # exact result:  no rounding errors so far!Fraction(606931231449932225838747590566767, 79228162514264337593543950336)>>> from math import sqrt>>> sqrt(ss / len(s))  # population sdev with 2 roundings12.137473069268983 >>> sqrt(ss / (len(s) - 1))     # sample sdev with 2 roundings12.255890244843338

So - surprise, surprise ;-) - np.std(s) computed the best possible double approximation to the population standard deviation, and R's sd() computed the best possible double approximation to the sample standard deviation.

So, in this specific case, the numeric difference between the computed means was a red herring - and because the mean was tiny compared to the original numbers, just about any way of computing standard deviation gives good numeric results.

The real difference here is just that np uses n in the denominator (population sdev) by default, while R uses n-1 in the denominator (sample sdev) by default.


Remember that the precision of 64 bit is only about 2e-16. If you sum these numbers you find that the sum, just like the mean, is very close to 0. So the problem likely has to do with that precision. Each of the functions you reference needs to first sum the numbers. So, I went back to the start.

In R Reduce('+', s) yields the same sum as the python function sum. Within R and Python they're actually summing exactly the same. However, the mean and sum functions in R use more accurate methods to do their math. when you do all of the math within R in the same way it's being done in numpy then it's identical.

There are reasons to be concerned about the python calculations you are using. The R code you're using is actually handling things better. Try:

# Rsum(s)sum(s * 10000) / 10000Reduce('+', s)Reduce('+', s*10000)/10000# python (numpy is the same here)sum(s)sum(s * 10000) / 10000

The sum in R handles the scaling properly as both sums are the same. However, both R and python can't deal with it using a sequential sum method. Another thing you can try is scrambling the numbers. I won't provide the code but sum in R consistently gives the same value while both Reduce in R, and sum in python give different values depending on the orders

So, what do you do? I suggest that you're going to have to accept your precision is only so high and treat your values close to 0 as 0. This gives you problems, as you've seen, with functions that are summing these numbers internally like mean and standard deviation. The mean error that derives from the sum just blows up when you start then doing variance. Perhaps more information on exactly why such numbers must be the same would help you get more precise advice.

There is an alternative that should work if identical is all that matters. Don't use R's built in functions. They're too high quality and highlight problems in numpy stats. If you roll a mean and sd like I've shown you sum with Reduce then the results will be the same. However, what you're going to be doing is making R slower and less precise. If you can avoid this option at all, do so. For example:

npMean <- function(x) Reduce('+', x)/length(x)npMean(s)npSD <- function(x) {m <- npMean(x); sqrt( Reduce('+', (x - m)^2)/(length(x)) )}npSD(s)

will give precisely the python mean and (incorrect) numpy SD. Those will work but sometimes it's going to be hard to get around the guts of R making things too precise for you. Of course, if you can find python functions to replace the numpy ones and make your python code more accurate that would be even better.