drops a column if it exceeds a specific number of NA values
Although jezrael's answer works that is not the approach you should do. Instead, create a mask: ~df.isnull().sum().gt(2)
and apply it with .loc[:,m]
to access columns.
Full example:
import pandas as pdimport numpy as npdf = pd.DataFrame({ 'A':list('abcdef'), 'B':[np.nan,np.nan,np.nan,5,5,np.nan], 'C':[np.nan,8,np.nan,np.nan,2,3], 'D':[1,3,5,7,1,0], 'E':[5,3,6,9,2,np.nan], 'F':list('aaabbb')})m = ~df.isnull().sum().gt(2)df = df.loc[:,m]print(df)
Returns:
A D E F0 a 1 5.0 a1 b 3 3.0 a2 c 5 6.0 a3 d 7 9.0 b4 e 1 2.0 b5 f 0 NaN b
Explanation
Assume we print the columns and the mask before applying it.
print(df.columns.tolist())print(m.tolist())
It would return this:
['A', 'B', 'C', 'D', 'E', 'F'][True, False, False, True, True, True]
Columns B and C are unwanted (False). They are removed when the mask is applied.
I think best here is use dropna
with parameter thresh
:
thresh : int, optional
Require that many non-NA values.
So for vectorize solution subtract it from length of DataFrame
:
N = 2df = df.dropna(thresh=len(df)-N, axis=1)print (df) A D E F0 a 1 5.0 a1 b 3 3.0 a2 c 5 6.0 a3 d 7 9.0 b4 e 1 2.0 b5 f 0 NaN b
I suggest use DataFrame.pipe
for apply function for input DataFrame
with change df.column
to df[column]
, because dot notation with dynamic column names from variable failed (it try select column name column
):
df = pd.DataFrame({'A':list('abcdef'), 'B':[np.nan,np.nan,np.nan,5,5,np.nan], 'C':[np.nan,8,np.nan,np.nan,2,3], 'D':[1,3,5,7,1,0], 'E':[5,3,6,9,2,np.nan], 'F':list('aaabbb')})print (df) A B C D E F0 a NaN NaN 1 5.0 a1 b NaN 8.0 3 3.0 a2 c NaN NaN 5 6.0 a3 d 5.0 NaN 7 9.0 b4 e 5.0 2.0 1 2.0 b5 f NaN 3.0 0 NaN bdef check(df): for column in df: if df[column].isnull().sum() > 2: df.drop(column,axis=1, inplace=True) return df print (df.pipe(check)) A D E F0 a 1 5.0 a1 b 3 3.0 a2 c 5 6.0 a3 d 7 9.0 b4 e 1 2.0 b5 f 0 NaN b
Alternatively, you can use count
which counts non-null values
In [23]: df.loc[:, df.count().gt(len(df.index) - 2)]Out[23]: A D E F0 a 1 5.0 a1 b 3 3.0 a2 c 5 6.0 a3 d 7 9.0 b4 e 1 2.0 b5 f 0 NaN b