Best way to count the number of rows with missing values in a pandas DataFrame Best way to count the number of rows with missing values in a pandas DataFrame python python

Best way to count the number of rows with missing values in a pandas DataFrame


For the second count I think just subtract the number of rows from the number of rows returned from dropna:

In [14]:from numpy.random import randndf = pd.DataFrame(randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],               columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])dfOut[14]:        one       two     threea -0.209453 -0.881878  3.146375b       NaN       NaN       NaNc  0.049383 -0.698410 -0.482013d       NaN       NaN       NaNe -0.140198 -1.285411  0.547451f -0.219877  0.022055 -2.116037g       NaN       NaN       NaNh -0.224695 -0.025628 -0.703680In [18]:df.shape[0] - df.dropna().shape[0]Out[18]:3

The first could be achieved using the built in methods:

In [30]:df.isnull().values.ravel().sum()Out[30]:9

Timings

In [34]:%timeit sum([True for idx,row in df.iterrows() if any(row.isnull())])%timeit df.shape[0] - df.dropna().shape[0]%timeit sum(map(any, df.apply(pd.isnull)))1000 loops, best of 3: 1.55 ms per loop1000 loops, best of 3: 1.11 ms per loop1000 loops, best of 3: 1.82 ms per loopIn [33]:%timeit sum(df.isnull().values.ravel())%timeit df.isnull().values.ravel().sum()%timeit df.isnull().sum().sum()1000 loops, best of 3: 215 µs per loop1000 loops, best of 3: 210 µs per loop1000 loops, best of 3: 605 µs per loop

So my alternatives are a little faster for a df of this size

Update

So for a df with 80,000 rows I get the following:

In [39]:%timeit sum([True for idx,row in df.iterrows() if any(row.isnull())])%timeit df.shape[0] - df.dropna().shape[0]%timeit sum(map(any, df.apply(pd.isnull)))%timeit np.count_nonzero(df.isnull())1 loops, best of 3: 9.33 s per loop100 loops, best of 3: 6.61 ms per loop100 loops, best of 3: 3.84 ms per loop1000 loops, best of 3: 395 µs per loopIn [40]:%timeit sum(df.isnull().values.ravel())%timeit df.isnull().values.ravel().sum()%timeit df.isnull().sum().sum()%timeit np.count_nonzero(df.isnull().values.ravel())1000 loops, best of 3: 675 µs per loop1000 loops, best of 3: 679 µs per loop100 loops, best of 3: 6.56 ms per loop1000 loops, best of 3: 368 µs per loop

Actually np.count_nonzero wins this hands down.


So many wrong answers here. OP asked for number of rows with null values, not columns.

Here is a better example:

from numpy.random import randndf = pd.DataFrame(randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one','two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h','asdf'])print(df)

`Now there is obviously 4 rows with null values.

           one       two     threea    -0.571617  0.952227  0.030825b          NaN       NaN       NaNc     0.627611 -0.462141  1.047515d          NaN       NaN       NaNe     0.043763  1.351700  1.480442f     0.630803  0.931862  1.500602g          NaN       NaN       NaNh     0.729103 -1.198237 -0.207602asdf       NaN       NaN       NaN

You would get answer as 3 (number of columns with NaNs) if you used some of the answers here. Fuentes' answer works.

Here is how I got it:

df.isnull().any(axis=1).sum()#4timeit df.isnull().any(axis=1).sum()#10000 loops, best of 3: 193 µs per loop

'Fuentes':

sum(df.apply(lambda x: sum(x.isnull().values), axis = 1)>0)#4timeit sum(df.apply(lambda x: sum(x.isnull().values), axis = 1)>0)#1000 loops, best of 3: 677 µs per loop


What about numpy.count_nonzero:

 np.count_nonzero(df.isnull().values)    np.count_nonzero(df.isnull())           # also works  

count_nonzero is pretty quick. However, I constructed a dataframe from a (1000,1000) array and randomly inserted 100 nan values at different positions and measured the times of the various answers in iPython:

%timeit np.count_nonzero(df.isnull().values)1000 loops, best of 3: 1.89 ms per loop%timeit df.isnull().values.ravel().sum()100 loops, best of 3: 3.15 ms per loop%timeit df.isnull().sum().sum()100 loops, best of 3: 15.7 ms per loop

Not a huge time improvement over the OPs original but possibly less confusing in the code, your decision. There isn't really any difference in execution timebetween the two count_nonzero methods (with and without .values).