Why doesn't "is not None" work with dataframe.loc, but "!= None" works fine? Why doesn't "is not None" work with dataframe.loc, but "!= None" works fine? pandas pandas

Why doesn't "is not None" work with dataframe.loc, but "!= None" works fine?


Instead of

df_ = df.loc[df['entities'] is not None]

or

df_ = df.loc[df['entities'] != None]

you should rather use

df_ = df.loc[df['entities'].isna()]

Because the representation of missing values in pandas is different from the usual python way to represent missing values by None. In particular you get the key error, because the column series df['entities'] is checked for identity with None. This evaluates to True in any case, because the series is not None. Then the .loc searches the row index for True, which is not present in your case, so it raises the exception. != doesn't cause this exception to be raised, because the equality operator is overloaded by pandas.Series (otherwise you couldn't build indexers by comparing a column with a fixed value like in df['name'] == 'Miller'). This overloaded method performs an elementwise comparison and itself returns an indexer that works fine with the .loc method. Just the result might not be, what you intended.

E.g. if you do

import pandas as pdimport numpy as npdf= pd.DataFrame(dict(x=[1,2,3], y=list('abc'), nulls= [None, np.NaN, np.float32('inf')]))df['nulls'].isna()

It returns:

Out[18]: 0     True1     True2    FalseName: nulls, dtype: bool

but the code:

df['nulls'] == None

returns

Out[20]: 0    False1    False2    FalseName: nulls, dtype: bool

If you look at the datatype of the objects stored in the column, you see they are all floats:

df['nulls'].map(type)Out[19]: 0    <class 'float'>1    <class 'float'>2    <class 'float'>Name: nulls, dtype: object

For columns of other types the representation of missing values might even be different. E.g. if you use Int64 columns it looks like this:

df['nulls_int64']= pd.Series([None, 1 , 2], dtype='Int64')df['nulls_int64'].map(type)Out[26]: 0    <class 'float'>1      <class 'int'>2      <class 'int'>Name: nulls_int64, dtype: object

So using isna() instead of != None also helps you to keep your code clean from handling pandas-internal data representations.


I'm somewhat going out on a limb here, since I don't have experience with Pandas, but with Python…

Panda's magic filtering through [] is based a lot on operator overloading. In this expression:

df.loc[df['entities'] != None]

df['entities'] is an object which implements the __ne__ method. Which means you're essentially doing:

df.loc[df['entities'].__ne__(None)]

The df['entities'].__ne__(None) is producing some new magic conditions object. The df.loc object implements the __getitem__ method to overload the [] subscript syntax, so the whole thing is essentially:

df.loc.__getitem__(df['entities'].__ne__(None))

On the other hand, the is operator is not overloadable. There's no __is__ method an object could implement, so df['entities'] is not None is evaluated just as is by Python's core rules, and since df['entities'] probably really is not None, the result of that expression is True. So just:

df.loc.__getitem__(True)

And that's why the error message complains about the KeyError: True.