Compare a matrix against a column vector
Consider the pd.DataFrame
and pd.Series
, A
and B
A = pd.DataFrame([ [28, 39, 52], [77, 80, 66], [7, 18, 24], [9, 97, 68] ])B = pd.Series([32, 5, 42, 17])
pandas
By default, when you compare a pd.DataFrame
with a pd.Series
, pandas aligns each index value from the series with the column names of the dataframe. This is what happens when you use A < B
. In this case, you have 4 rows in your dataframe and 4 elements in your series, so I'm going to assume you want to align the index values of the series with the index values of the dataframe. In order to specify the axis you want to align with, you need to use the comparison method rather than the operator. That's because when you use the method, you can use the axis
parameter and specify that you want axis=0
rather than the default axis=1
.
A.lt(B, axis=0) 0 1 20 True False False1 False False False2 True True True3 True False False
I often just write this as A.lt(B, 0)
numpy
In numpy, you also have to pay attention to the dimensionality of the arrays and you are assuming that the positions are already lined up. The positions will be taken care of if they come from the same dataframe.
print(A.values)[[28 39 52] [77 80 66] [ 7 18 24] [ 9 97 68]]print(B.values)[32 5 42 17]
Notice that B
is a 1 dimensional array while A is a 2 dimensional array. In order to compare B
along the rows of A
we need to reshape B
into a 2 dimensional array. The most obvious way to do this is with reshape
print(A.values < B.values.reshape(4, 1))[[ True False False] [False False False] [ True True True] [ True False False]]
However, these are ways in which you will commonly see others do the same reshaping
A.values < B.values.reshape(-1, 1)
Or
A.values < B.values[:, None]
timed back test
To get a handle of how fast these comparisons are, I've constructed the following back test.
def pd_cmp(df, s): return df.lt(s, 0)def np_cmp_a2a(df, s): """To get an apples to apples comparison I return the same thing in both functions""" return pd.DataFrame( df.values < s.values[:, None], df.index, df.columns )def np_cmp_a2o(df, s): """To get an apples to oranges comparison I return a numpy array""" return df.values < s.values[:, None]results = pd.DataFrame( index=pd.Index([10, 1000, 100000], name='group size'), columns=pd.Index(['pd_cmp', 'np_cmp_a2a', 'np_cmp_a2o'], name='method'),)from timeit import timeitfor i in results.index: df = pd.concat([A] * i, ignore_index=True) s = pd.concat([B] * i, ignore_index=True) for j in results.columns: results.set_value( i, j, timeit( '{}(df, s)'.format(j), 'from __main__ import {}, df, s'.format(j), number=100 ) )results.plot()
I can conclude that the numpy
based solutions are faster but not all that much. They all scale the same.
You can do this using lt
and calling squeeze
on B
so it flattens the df to a 1-D Series:
In [107]:A.lt(B.squeeze(),axis=0)Out[107]: 0 1 20 True False False1 False False False2 True True True3 True False False
The problem is that without squeeze
then it will try to align on the column labels which we don't want. We want to broadcast the comparison along the column-axis
The more efficient is to go down numpy level (A,B are DataFrames here):
A.values<B.values