numpy corrcoef - compute correlation matrix while ignoring missing data
One of the main features of pandas
is being NaN
friendly. To calculate correlation matrix, simply call df_counties.corr()
. Below is an example to demonstrate df.corr()
is NaN
tolerant whereas np.corrcoef
is not.
import pandas as pdimport numpy as np# data# ==============================np.random.seed(0)df = pd.DataFrame(np.random.randn(100,5), columns=list('ABCDE'))df[df < 0] = np.nandf A B C D E0 1.7641 0.4002 0.9787 2.2409 1.86761 NaN 0.9501 NaN NaN 0.41062 0.1440 1.4543 0.7610 0.1217 0.44393 0.3337 1.4941 NaN 0.3131 NaN4 NaN 0.6536 0.8644 NaN 2.26985 NaN 0.0458 NaN 1.5328 1.46946 0.1549 0.3782 NaN NaN NaN7 0.1563 1.2303 1.2024 NaN NaN8 NaN NaN NaN 1.9508 NaN9 NaN NaN 0.7775 NaN NaN.. ... ... ... ... ...90 NaN 0.8202 0.4631 0.2791 0.338991 2.0210 NaN NaN 0.1993 NaN92 NaN NaN NaN 0.1813 NaN93 2.4125 NaN NaN NaN 0.251594 NaN NaN NaN NaN 1.738995 0.9944 1.3191 NaN 1.1286 0.496096 0.7714 1.0294 NaN NaN 0.862697 NaN 1.5133 0.5531 NaN 0.220598 NaN NaN 1.1003 1.2980 2.696299 NaN NaN NaN NaN NaN[100 rows x 5 columns]# calculations# ================================df.corr() A B C D EA 1.0000 0.2718 0.2678 0.2822 0.1016B 0.2718 1.0000 -0.0692 0.1736 -0.1432C 0.2678 -0.0692 1.0000 -0.3392 0.0012D 0.2822 0.1736 -0.3392 1.0000 0.1562E 0.1016 -0.1432 0.0012 0.1562 1.0000np.corrcoef(df, rowvar=False)array([[ nan, nan, nan, nan, nan], [ nan, nan, nan, nan, nan], [ nan, nan, nan, nan, nan], [ nan, nan, nan, nan, nan], [ nan, nan, nan, nan, nan]])
This will work, using the masked array numpy
module:
import numpy as npimport numpy.ma as maA = [1, 2, 3, 4, 5, np.NaN]B = [2, 3, 4, 5.25, np.NaN, 100]print(ma.corrcoef(ma.masked_invalid(A), ma.masked_invalid(B)))
It outputs:
[[1.0 0.99838143945703] [0.99838143945703 1.0]]
Read more here: https://docs.scipy.org/doc/numpy/reference/maskedarray.generic.html
In case you expect a different number of nans in each array, you may consider taking a logical AND of non-nan masks.
import numpy as npimport numpy.ma as maa=ma.masked_invalid(A)b=ma.masked_invalid(B)msk = (~a.mask & ~b.mask)print(ma.corrcoef(a[msk],b[msk]))