mean from pandas and numpy differ
Short version:
The reason it's different is because pandas
uses bottleneck
(if it's installed) when calling the mean
operation, as opposed to just relying on numpy
. bottleneck
is presumably used since it appears to be faster than numpy
(at least on my machine), but at the cost of precision. They happen to match for the 64 bit version, but differ in 32 bit land (which is the interesting part).
Long version:
It's extremely difficult to tell what's going on just by inspecting the source code of these modules (they're quite complex, even for simple computations like mean
, turns out numerical computing is hard). Best to use the debugger to avoid brain-compiling and those types of mistakes. The debugger won't make a mistake in logic, it'll tell you exactly what's going on.
Here's some of my stack trace (values differ slightly since no seed for RNG):
Can reproduce (Windows):
>>> import numpy as np; import pandas as pd>>> x=np.random.normal(-9.,.005,size=900000)>>> df=pd.DataFrame(x,dtype='float32',columns=['x'])>>> df['x'].mean()-9.0>>> x.mean()-9.0000037501099754>>> x.astype(np.float32).mean()-9.0000029
Nothing extraordinary going on with numpy
's version. It's the pandas
version that's a little wacky.
Let's have a look inside df['x'].mean()
:
>>> def test_it_2():... import pdb; pdb.set_trace()... df['x'].mean()>>> test_it_2()... # Some stepping/poking around that isn't important(Pdb) l23072308 if we have an ndarray as a value, then simply perform the operation,2309 otherwise delegate to the object23102311 """2312 -> delegate = self._values2313 if isinstance(delegate, np.ndarray):2314 # Validate that 'axis' is consistent with Series's single axis.2315 self._get_axis_number(axis)2316 if numeric_only:2317 raise NotImplementedError('Series.{0} does not implement '(Pdb) delegate.dtypedtype('float32')(Pdb) l2315 self._get_axis_number(axis)2316 if numeric_only:2317 raise NotImplementedError('Series.{0} does not implement '2318 'numeric_only.'.format(name))2319 with np.errstate(all='ignore'):2320 -> return op(delegate, skipna=skipna, **kwds)23212322 return delegate._reduce(op=op, name=name, axis=axis, skipna=skipna,2323 numeric_only=numeric_only,2324 filter_type=filter_type, **kwds)
So we found the trouble spot, but now things get kind of weird:
(Pdb) op<function nanmean at 0x000002CD8ACD4488>(Pdb) op(delegate)-9.0(Pdb) delegate_64 = delegate.astype(np.float64)(Pdb) op(delegate_64)-9.000003749978807(Pdb) delegate.mean()-9.0000029(Pdb) delegate_64.mean()-9.0000037499788075(Pdb) np.nanmean(delegate, dtype=np.float64)-9.0000037499788075(Pdb) np.nanmean(delegate, dtype=np.float32)-9.0000029
Note that delegate.mean()
and np.nanmean
output -9.0000029
with type float32
, not -9.0
as pandas
nanmean
does. With a bit of poking around, you can find the source to pandas
nanmean
in pandas.core.nanops
. Interestingly, it actually appears like it should be matching numpy
at first. Let's have a look at pandas
nanmean
:
(Pdb) import inspect(Pdb) src = inspect.getsource(op).split("\n")(Pdb) for line in src: print(line)@disallow('M8')@bottleneck_switch()def nanmean(values, axis=None, skipna=True): values, mask, dtype, dtype_max = _get_values(values, skipna, 0) dtype_sum = dtype_max dtype_count = np.float64 if is_integer_dtype(dtype) or is_timedelta64_dtype(dtype): dtype_sum = np.float64 elif is_float_dtype(dtype): dtype_sum = dtype dtype_count = dtype count = _get_counts(mask, axis, dtype=dtype_count) the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum)) if axis is not None and getattr(the_sum, 'ndim', False): the_mean = the_sum / count ct_mask = count == 0 if ct_mask.any(): the_mean[ct_mask] = np.nan else: the_mean = the_sum / count if count > 0 else np.nan return _wrap_results(the_mean, dtype)
Here's a (short) version of the bottleneck_switch
decorator:
import bottleneck as bn...class bottleneck_switch(object): def __init__(self, **kwargs): self.kwargs = kwargs def __call__(self, alt): bn_name = alt.__name__ try: bn_func = getattr(bn, bn_name) except (AttributeError, NameError): # pragma: no cover bn_func = None ... if (_USE_BOTTLENECK and skipna and _bn_ok_dtype(values.dtype, bn_name)): result = bn_func(values, axis=axis, **kwds)
This is called with alt
as the pandas
nanmean
function, so bn_name
is 'nanmean'
, and this is the attr that's grabbed from the bottleneck
module:
(Pdb) l 93 result = np.empty(result_shape) 94 result.fill(0) 95 return result 96 97 if (_USE_BOTTLENECK and skipna and 98 -> _bn_ok_dtype(values.dtype, bn_name)): 99 result = bn_func(values, axis=axis, **kwds)100101 # prefer to treat inf/-inf as NA, but must compute the fun102 # twice :(103 if _has_infs(result):(Pdb) n> d:\anaconda3\lib\site-packages\pandas\core\nanops.py(99)f()-> result = bn_func(values, axis=axis, **kwds)(Pdb) alt<function nanmean at 0x000001D2C8C04378>(Pdb) alt.__name__'nanmean'(Pdb) bn_func<built-in function nanmean>(Pdb) bn_name'nanmean'(Pdb) bn_func(values, axis=axis, **kwds)-9.0
Pretend that bottleneck_switch()
decorator doesn't exist for a second. We can actually see that calling that manually stepping through this function (without bottleneck
) will get you the same result as numpy
:
(Pdb) from pandas.core.nanops import _get_counts(Pdb) from pandas.core.nanops import _get_values(Pdb) from pandas.core.nanops import _ensure_numeric(Pdb) values, mask, dtype, dtype_max = _get_values(delegate, skipna=skipna)(Pdb) count = _get_counts(mask, axis=None, dtype=dtype)(Pdb) count900000.0(Pdb) values.sum(axis=None, dtype=dtype) / count-9.0000029
That never gets called, though, if you have bottleneck
installed. Instead, the bottleneck_switch()
decorator instead blasts over the nanmean
function with bottleneck
's version. This is where the discrepancy lies (interestingly it matches on the float64
case, though):
(Pdb) import bottleneck as bn(Pdb) bn.nanmean(delegate)-9.0(Pdb) bn.nanmean(delegate.astype(np.float64))-9.000003749978807
bottleneck
is used solely for speed, as far as I can tell. I'm assuming they're taking some type of shortcut with their nanmean
function, but I didn't look into it much (see @ead's answer for details on this topic). You can see that it's typically a bit faster than numpy
by their benchmarks: https://github.com/kwgoodman/bottleneck. Clearly the price to pay for this speed is precision.
Is bottleneck actually faster?
Sure looks like it (at least on my machine).
In [1]: import numpy as np; import pandas as pdIn [2]: x=np.random.normal(-9.8,.05,size=900000)In [3]: y_32 = x.astype(np.float32)In [13]: %timeit np.nanmean(y_32)100 loops, best of 3: 5.72 ms per loopIn [14]: %timeit bn.nanmean(y_32)1000 loops, best of 3: 854 µs per loop
It might be nice for pandas
to introduce a flag here (one for speed, the other for better precision, default is for speed since that's the current impl). Some users care much more about the accuracy of the computation than the speed at which it happens.
HTH.
@Matt Messersmith answer is a great investigation, but I would like to add an in my opinion important point: both results (numpy's and pandas') are wrong. However, numpy has higher probability to be less wrong than panda.
There is no fundamental difference between using float32
and float64
, however, for float32
, problems can be observed for smaller data sets than for float64
.
It is not really defined, how the mean
should be calculated - the given mathematical definition is only unambiguous for infinitely precise numbers, but not for the floating point operations our PCs are using.
So what is the "right" formula?
mean = (x0+..xn)/n or mean = [(x0+x1)+(x2+x3)+..]/n or mean = 1.0/n*(x0+..xn) and so on...
Obviously, when calculated on modern hardware they all will give different results - one would ideally peek a formula which makes the smallest error compared to a theoretical right value (which is calculated with infinite precision).
Numpy uses slightly alternated pairwise summation, i.e. (((x1+x2)+(x3+x4))+(...))
, which, even if not perfect, is known to be quite good. On the other hand, bottleneck uses the naive summation x1+x2+x3+...
:
REDUCE_ALL(nanmean, DTYPE0){ ... WHILE { FOR { ai = AI(DTYPE0); if (ai == ai) { asum += ai; <---- HERE WE GO count += 1; } } NEXT } ...}
and we can easily see what goes on: After some steps, bottleneck
sums one large (sum of all previous elements, proportional to -9.8*number_of_steps
) and one small element (about -9.8
) which leads to quite an rounding error of about big_number*eps
, with eps being around 1e-7
for float32
. This means that after 10^6 summations we could have a relative error of about 10% (eps*10^6
, this is an upper bound).
For float64
and eps
being about 1e-16
the relative error would be only about 1e-10
after 10^6 summations. It might seem to be precise to us, but measured against the possible precision it is still a fiasco!
Numpy on the other hand (at least for the series at hand) will add two elements which are almost equal. In this case the upper bound for the resulting relative error is eps*log_2(n)
, which is
- maximal
2e-6
forfloat32
and 10^6 elements - maximal
2e-15
forfloat64
and 10^6 elements.
From the above, among others, there are following noteworthy implications:
- if the mean of the distribution is
0
, then pandas and numpy are almost equally precise - the magnitude of summed numbers is about0.0
and there is no big difference between summands which would lead to a large rounding error for naive summation. - if one knows a good estimate for the mean, it might be more robust to calculate the sum of
x'i=xi-mean_estimate
, becausex'i
will have mean of0.0
. - something like
x=(.333*np.ones(1000000)).astype(np.float32)
is enough to trigger the strange behavior of pandas' version - no need for randomness, and we know what the result should be, don't we? It is important, that0.333
cannot be presented precisely with floating point.
NB: The above holds for 1-dimensional numpy-arrays. Situation is more complicated for summing along an axis for multi-dimensional numpy-arrays, as numpy sometimes switches to naive-summation. For a more detailed investigation see this SO-post, which also explains @Mark Dickinson observation, i.e.:
np.ones((2, 10**8), dtype=np.float32).mean(axis=1)
are accurate butnp.ones((10**8, 2), dtype=np.float32).mean(axis=0)
aren't