numpy array: replace nan values with average of columns

No loops required:

print(a)[[ 0.93230948         nan  0.47773439  0.76998063] [ 0.94460779  0.87882456  0.79615838  0.56282885] [ 0.94272934  0.48615268  0.06196785         nan] [ 0.64940216  0.74414127         nan         nan]]#Obtain mean of columns as you need, nanmean is convenient.col_mean = np.nanmean(a, axis=0)print(col_mean)[ 0.86726219  0.7030395   0.44528687  0.66640474]#Find indices that you need to replaceinds = np.where(np.isnan(a))#Place column means in the indices. Align the arrays using takea[inds] = np.take(col_mean, inds[1])print(a)[[ 0.93230948  0.7030395   0.47773439  0.76998063] [ 0.94460779  0.87882456  0.79615838  0.56282885] [ 0.94272934  0.48615268  0.06196785  0.66640474] [ 0.64940216  0.74414127  0.44528687  0.66640474]]

python arrays numpy nan

Using masked arrays

The standard way to do this using only numpy would be to use the masked array module.

Scipy is a pretty heavy package which relies on external libraries, so it's worth having a numpy-only method. This borrows from @DonaldHobson's answer.

Edit: np.nanmean is now a numpy function. However, it doesn't handle all-nan columns...

Suppose you have an array a:

>>> aarray([[  0.,  nan,  10.,  nan],       [  1.,   6.,  nan,  nan],       [  2.,   7.,  12.,  nan],       [  3.,   8.,  nan,  nan],       [ nan,   9.,  14.,  nan]])>>> import numpy.ma as ma>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=0), a)    array([[  0. ,   7.5,  10. ,   0. ],       [  1. ,   6. ,  12. ,   0. ],       [  2. ,   7. ,  12. ,   0. ],       [  3. ,   8. ,  12. ,   0. ],       [  1.5,   9. ,  14. ,   0. ]])

Note that the masked array's mean does not need to be the same shape as a, because we're taking advantage of the implicit broadcasting over rows.

Also note how the all-nan column is nicely handled. The mean is zero since you're taking the mean of zero elements. The method using nanmean doesn't handle all-nan columns:

>>> col_mean = np.nanmean(a, axis=0)/home/praveen/.virtualenvs/numpy3-mkl/lib/python3.4/site-packages/numpy/lib/nanfunctions.py:675: RuntimeWarning: Mean of empty slice  warnings.warn("Mean of empty slice", RuntimeWarning)>>> inds = np.where(np.isnan(a))>>> a[inds] = np.take(col_mean, inds[1])>>> aarray([[  0. ,   7.5,  10. ,   nan],       [  1. ,   6. ,  12. ,   nan],       [  2. ,   7. ,  12. ,   nan],       [  3. ,   8. ,  12. ,   nan],       [  1.5,   9. ,  14. ,   nan]])

Explanation

Converting a into a masked array gives you

>>> ma.array(a, mask=np.isnan(a))masked_array(data = [[0.0 --  10.0 --]  [1.0 6.0 --   --]  [2.0 7.0 12.0 --]  [3.0 8.0 --   --]  [--  9.0 14.0 --]],             mask = [[False  True False  True] [False False  True  True] [False False False  True] [False False  True  True] [ True False False  True]],       fill_value = 1e+20)

And taking the mean over columns gives you the correct answer, normalizing only over the non-masked values:

>>> ma.array(a, mask=np.isnan(a)).mean(axis=0)masked_array(data = [1.5 7.5 12.0 --],             mask = [False False False  True],       fill_value = 1e+20)

Further, note how the mask nicely handles the column which is all-nan!

Finally, np.where does the job of replacement.

Row-wise mean

To replace nan values with row-wise mean instead of column-wise mean requires a tiny change for broadcasting to take effect nicely:

>>> aarray([[  0.,   1.,   2.,   3.,  nan],       [ nan,   6.,   7.,   8.,   9.],       [ 10.,  nan,  12.,  nan,  14.],       [ nan,  nan,  nan,  nan,  nan]])>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1), a)ValueError: operands could not be broadcast together with shapes (4,5) (4,) (4,5)>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1)[:, np.newaxis], a)array([[  0. ,   1. ,   2. ,   3. ,   1.5],       [  7.5,   6. ,   7. ,   8. ,   9. ],       [ 10. ,  12. ,  12. ,  12. ,  14. ],       [  0. ,   0. ,   0. ,   0. ,   0. ]])

python arrays numpy nan

If partial is your original data, and replace is an array of the same shape containing averaged values then this code will use the value from partial if one exists.

Complete= np.where(np.isnan(partial),replace,partial)

CodeHunter

numpy array: replace nan values with average of columns

Using masked arrays

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last