Pandas DataFrames with NaNs equality comparison Pandas DataFrames with NaNs equality comparison python python

Pandas DataFrames with NaNs equality comparison


You can use assert_frame_equals with check_names=False (so as not to check the index/columns names), which will raise if they are not equal:

In [11]: from pandas.testing import assert_frame_equalIn [12]: assert_frame_equal(df, expected, check_names=False)

You can wrap this in a function with something like:

try:    assert_frame_equal(df, expected, check_names=False)    return Trueexcept AssertionError:    return False

In more recent pandas this functionality has been added as .equals:

df.equals(expected)


One of the properties of NaN is that NaN != NaN is True.

Check out this answer for a nice way to do this using numexpr.

(a == b) | ((a != a) & (b != b))

says this (in pseudocode):

a == b or (isnan(a) and isnan(b))

So, either a equals b, or both a and b are NaN.

If you have small frames then assert_frame_equal will be okay. However, for large frames (10M rows) assert_frame_equal is pretty much useless. I had to interrupt it, it was taking so long.

In [1]: df = DataFrame(rand(1e7, 15))In [2]: df = df[df > 0.5]In [3]: df2 = df.copy()In [4]: dfOut[4]:<class 'pandas.core.frame.DataFrame'>Int64Index: 10000000 entries, 0 to 9999999Columns: 15 entries, 0 to 14dtypes: float64(15)In [5]: timeit (df == df2) | ((df != df) & (df2 != df2))1 loops, best of 3: 598 ms per loop

timeit of the (presumably) desired single bool indicating whether the two DataFrames are equal:

In [9]: timeit ((df == df2) | ((df != df) & (df2 != df2))).values.all()1 loops, best of 3: 687 ms per loop


Like @PhillipCloud answer, but more written out

In [26]: df1 = DataFrame([[np.nan,1],[2,np.nan]])In [27]: df2 = df1.copy()

They really are equivalent

In [28]: result = df1 == df2In [29]: result[pd.isnull(df1) == pd.isnull(df2)] = TrueIn [30]: resultOut[30]:       0     10  True  True1  True  True

A nan in df2 that doesn't exist in df1

In [31]: df2 = DataFrame([[np.nan,1],[np.nan,np.nan]])In [32]: result = df1 == df2In [33]: result[pd.isnull(df1) == pd.isnull(df2)] = TrueIn [34]: resultOut[34]:        0     10   True  True1  False  True

You can also fill with a value you know not to be in the frame

In [38]: df1.fillna(-999) == df1.fillna(-999)Out[38]:       0     10  True  True1  True  True