drops a column if it exceeds a specific number of NA values

python python-3.x pandas dataframe data-analysis

Although jezrael's answer works that is not the approach you should do. Instead, create a mask: ~df.isnull().sum().gt(2) and apply it with .loc[:,m] to access columns.

Full example:

import pandas as pdimport numpy as npdf = pd.DataFrame({    'A':list('abcdef'),    'B':[np.nan,np.nan,np.nan,5,5,np.nan],    'C':[np.nan,8,np.nan,np.nan,2,3],    'D':[1,3,5,7,1,0],    'E':[5,3,6,9,2,np.nan],    'F':list('aaabbb')})m = ~df.isnull().sum().gt(2)df = df.loc[:,m]print(df)

Returns:

   A  D    E  F0  a  1  5.0  a1  b  3  3.0  a2  c  5  6.0  a3  d  7  9.0  b4  e  1  2.0  b5  f  0  NaN  b

Explanation

Assume we print the columns and the mask before applying it.

print(df.columns.tolist())print(m.tolist())

It would return this:

['A', 'B', 'C', 'D', 'E', 'F'][True, False, False, True, True, True]

Columns B and C are unwanted (False). They are removed when the mask is applied.

python python-3.x pandas dataframe data-analysis

I think best here is use dropna with parameter thresh:

thresh : int, optional
Require that many non-NA values.

So for vectorize solution subtract it from length of DataFrame:

N = 2df = df.dropna(thresh=len(df)-N, axis=1)print (df)   A  D    E  F0  a  1  5.0  a1  b  3  3.0  a2  c  5  6.0  a3  d  7  9.0  b4  e  1  2.0  b5  f  0  NaN  b

I suggest use DataFrame.pipe for apply function for input DataFrame with change df.column to df[column], because dot notation with dynamic column names from variable failed (it try select column name column):

df = pd.DataFrame({'A':list('abcdef'),                   'B':[np.nan,np.nan,np.nan,5,5,np.nan],                   'C':[np.nan,8,np.nan,np.nan,2,3],                   'D':[1,3,5,7,1,0],                   'E':[5,3,6,9,2,np.nan],                   'F':list('aaabbb')})print (df)   A    B    C  D    E  F0  a  NaN  NaN  1  5.0  a1  b  NaN  8.0  3  3.0  a2  c  NaN  NaN  5  6.0  a3  d  5.0  NaN  7  9.0  b4  e  5.0  2.0  1  2.0  b5  f  NaN  3.0  0  NaN  bdef check(df):    for column in df:        if df[column].isnull().sum() > 2:            df.drop(column,axis=1, inplace=True)    return df            print (df.pipe(check))   A  D    E  F0  a  1  5.0  a1  b  3  3.0  a2  c  5  6.0  a3  d  7  9.0  b4  e  1  2.0  b5  f  0  NaN  b

python python-3.x pandas dataframe data-analysis

Alternatively, you can use count which counts non-null values

In [23]: df.loc[:, df.count().gt(len(df.index) - 2)]Out[23]:   A  D    E  F0  a  1  5.0  a1  b  3  3.0  a2  c  5  6.0  a3  d  7  9.0  b4  e  1  2.0  b5  f  0  NaN  b

CodeHunter

drops a column if it exceeds a specific number of NA values

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last