Why doesn't "is not None" work with dataframe.loc, but "!= None" works fine?
Instead of
df_ = df.loc[df['entities'] is not None]
or
df_ = df.loc[df['entities'] != None]
you should rather use
df_ = df.loc[df['entities'].isna()]
Because the representation of missing values in pandas is different from the usual python way to represent missing values by None. In particular you get the key error, because the column series df['entities']
is checked for identity with None
. This evaluates to True
in any case, because the series is not None
. Then the .loc
searches the row index for True
, which is not present in your case, so it raises the exception. !=
doesn't cause this exception to be raised, because the equality operator is overloaded by pandas.Series
(otherwise you couldn't build indexers by comparing a column with a fixed value like in df['name'] == 'Miller'
). This overloaded method performs an elementwise comparison and itself returns an indexer that works fine with the .loc
method. Just the result might not be, what you intended.
E.g. if you do
import pandas as pdimport numpy as npdf= pd.DataFrame(dict(x=[1,2,3], y=list('abc'), nulls= [None, np.NaN, np.float32('inf')]))df['nulls'].isna()
It returns:
Out[18]: 0 True1 True2 FalseName: nulls, dtype: bool
but the code:
df['nulls'] == None
returns
Out[20]: 0 False1 False2 FalseName: nulls, dtype: bool
If you look at the datatype of the objects stored in the column, you see they are all floats:
df['nulls'].map(type)Out[19]: 0 <class 'float'>1 <class 'float'>2 <class 'float'>Name: nulls, dtype: object
For columns of other types the representation of missing values might even be different. E.g. if you use Int64
columns it looks like this:
df['nulls_int64']= pd.Series([None, 1 , 2], dtype='Int64')df['nulls_int64'].map(type)Out[26]: 0 <class 'float'>1 <class 'int'>2 <class 'int'>Name: nulls_int64, dtype: object
So using isna()
instead of != None
also helps you to keep your code clean from handling pandas-internal data representations.
I'm somewhat going out on a limb here, since I don't have experience with Pandas, but with Python…
Panda's magic filtering through []
is based a lot on operator overloading. In this expression:
df.loc[df['entities'] != None]
df['entities']
is an object which implements the __ne__
method. Which means you're essentially doing:
df.loc[df['entities'].__ne__(None)]
The df['entities'].__ne__(None)
is producing some new magic conditions object. The df.loc
object implements the __getitem__
method to overload the []
subscript syntax, so the whole thing is essentially:
df.loc.__getitem__(df['entities'].__ne__(None))
On the other hand, the is
operator is not overloadable. There's no __is__
method an object could implement, so df['entities'] is not None
is evaluated just as is by Python's core rules, and since df['entities']
probably really is not None
, the result of that expression is True
. So just:
df.loc.__getitem__(True)
And that's why the error message complains about the KeyError: True
.