propagate conditional column value in pandas
You can use
In [954]: df['Indicator'] = (df.assign(eq=df.Question.eq('baz') & df.Score.eq('yes')) .groupby(['Customer', 'Period'])['eq'] .transform('any').astype(int))In [955]: dfOut[955]: Customer Period Question Score Indicator0 A 1 foo 2 11 A 1 bar 3 12 A 1 baz yes 13 A 1 biz 1 14 B 1 bar 2 05 B 1 baz no 06 B 1 qux 3 07 A 2 foo 5 18 A 2 baz yes 19 B 2 baz yes 110 B 2 biz 2 1
Details
In [956]: df.Question.eq('baz') & df.Score.eq('yes')Out[956]:0 False1 False2 True3 False4 False5 False6 False7 False8 True9 True10 Falsedtype: boolIn [957]: df.assign(eq=df.Question.eq('baz') & df.Score.eq('yes'))Out[957]: Customer Period Question Score Indicator eq0 A 1 foo 2 1 False1 A 1 bar 3 1 False2 A 1 baz yes 1 True3 A 1 biz 1 1 False4 B 1 bar 2 0 False5 B 1 baz no 0 False6 B 1 qux 3 0 False7 A 2 foo 5 1 False8 A 2 baz yes 1 True9 B 2 baz yes 1 True10 B 2 biz 2 1 False
Here's one way. The idea is to use a Boolean mask with MultiIndex
. Then use pd.Series.isin
to compare against your filtered indices.
mask = (df['Question'] == 'baz') & (df['Score'] == 'yes')idx_cols = ['Customer', 'Period']idx = df.set_index(idx_cols).loc[mask.values].indexdf['Indicator'] = pd.Series(df.set_index(idx_cols).index.values).isin(idx).astype(int)print(df) Customer Period Question Score Indicator0 A 1 foo 2 11 A 1 bar 3 12 A 1 baz yes 13 A 1 biz 1 14 B 1 bar 2 05 B 1 baz no 06 B 1 qux 3 07 A 2 foo 5 18 A 2 baz yes 19 B 2 baz yes 110 B 2 biz 2 1
You can factorize the tuples of Customer
and Period
. Then use np.logical_or.at
to get group-wise any
i, r = pd.factorize([*zip(df.Customer, df.Period)])a = np.zeros(len(r), dtype=np.bool8)np.logical_or.at(a, i, df.eval('Question == "baz" and Score == "yes"'))df.assign(Indicator=a[i].astype(np.int64)) Customer Period Question Score Indicator0 A 1 foo 2 11 A 1 bar 3 12 A 1 baz yes 13 A 1 biz 1 14 B 1 bar 2 05 B 1 baz no 06 B 1 qux 3 07 A 2 foo 5 18 A 2 baz yes 19 B 2 baz yes 110 B 2 biz 2 1
Explanation
i, r = pd.factorize([*zip(df.Customer, df.Period)])
produces unique (Customer, Period)
pairs in r
where i
is an array keeping track of which element of r
went where in order to produce the original list of tuples
Original list of tuples
[*zip(df.Customer, df.Period)][('A', 1), ('A', 1), ('A', 1), ('A', 1), ('B', 1), ('B', 1), ('B', 1), ('A', 2), ('A', 2), ('B', 2), ('B', 2)]
After factorizing, unique tuples
r
rarray([('A', 1), ('B', 1), ('A', 2), ('B', 2)], dtype=object)
And the positions
i
iarray([0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 3])
I can now use i
as indices for evaluating grouped any
in Numpy using Numpy's at
method on ufuncs
. Basically, this allows me to create an array upfront whose values may change based on my at
operation. Then specify an array of indices (that's what i
will be) and an array matching the size of i
that is the second part of my operation at that index.
I end up using as my matching array
df.eval('Question == "baz" and Score == "yes"')0 False1 False2 True3 False4 False5 False6 False7 False8 True9 True10 Falsedtype: bool
Let me show this in painstaking detail
Flag GroupIndex Group State of a0 False 0 (A, 1) [0, 0, 0, 0] # Flag is False, So do Nothing1 False 0 (A, 1) [0, 0, 0, 0] # Flag is False, So do Nothing2 True 0 (A, 1) [1, 0, 0, 0] # Flag is True, or_eq for Index 03 False 0 (A, 1) [1, 0, 0, 0] # Flag is False, So do Nothing4 False 1 (B, 1) [1, 0, 0, 0] # Flag is False, So do Nothing5 False 1 (B, 1) [1, 0, 0, 0] # Flag is False, So do Nothing6 False 1 (B, 1) [1, 0, 0, 0] # Flag is False, So do Nothing7 False 2 (A, 2) [1, 0, 0, 0] # Flag is False, So do Nothing8 True 2 (A, 2) [1, 0, 1, 0] # Flag is True, or_eq for Index 29 True 3 (B, 2) [1, 0, 1, 1] # Flag is True, or_eq for Index 310 False 3 (B, 2) [1, 0, 1, 1] # Flag is False, So do Nothing
The final State
is [1, 0, 1, 1]
or in boolean terms [True, False, True, True]
. And that represents the or
accumulation within each unique group that is housed in a
aarray([ True, False, True, True])
If I slice this with the index positions in i
and cast as integers, I get
a[i].astype(np.int64)array([1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1])
Which is precisely what we were looking for.
Finally, I use assign
to produce a copy of the dataframe with its new column.
df.assign(Indicator=a[i].astype(np.int64)) Customer Period Question Score Indicator0 A 1 foo 2 11 A 1 bar 3 12 A 1 baz yes 13 A 1 biz 1 14 B 1 bar 2 05 B 1 baz no 06 B 1 qux 3 07 A 2 foo 5 18 A 2 baz yes 19 B 2 baz yes 110 B 2 biz 2 1
Why Do it This Way?!
Numpy is often faster.
Below is a slightly more optimized approach. (basically the same)
i, r = pd.factorize([*zip(df.Customer, df.Period)])a = np.zeros(len(r), dtype=np.bool8)q = df.Question.values == 'baz's = df.Score.values == 'yes'm = q & snp.logical_or.at(a, i, m)df.assign(Indicator=a[i].astype(np.int64))