propagate conditional column value in pandas propagate conditional column value in pandas pandas pandas

propagate conditional column value in pandas


You can use

In [954]: df['Indicator'] = (df.assign(eq=df.Question.eq('baz') & df.Score.eq('yes'))                               .groupby(['Customer', 'Period'])['eq']                               .transform('any').astype(int))In [955]: dfOut[955]:   Customer  Period Question Score  Indicator0         A       1      foo     2          11         A       1      bar     3          12         A       1      baz   yes          13         A       1      biz     1          14         B       1      bar     2          05         B       1      baz    no          06         B       1      qux     3          07         A       2      foo     5          18         A       2      baz   yes          19         B       2      baz   yes          110        B       2      biz     2          1

Details

In [956]: df.Question.eq('baz') & df.Score.eq('yes')Out[956]:0     False1     False2      True3     False4     False5     False6     False7     False8      True9      True10    Falsedtype: boolIn [957]: df.assign(eq=df.Question.eq('baz') & df.Score.eq('yes'))Out[957]:   Customer  Period Question Score  Indicator     eq0         A       1      foo     2          1  False1         A       1      bar     3          1  False2         A       1      baz   yes          1   True3         A       1      biz     1          1  False4         B       1      bar     2          0  False5         B       1      baz    no          0  False6         B       1      qux     3          0  False7         A       2      foo     5          1  False8         A       2      baz   yes          1   True9         B       2      baz   yes          1   True10        B       2      biz     2          1  False


Here's one way. The idea is to use a Boolean mask with MultiIndex. Then use pd.Series.isin to compare against your filtered indices.

mask = (df['Question'] == 'baz') & (df['Score'] == 'yes')idx_cols = ['Customer', 'Period']idx = df.set_index(idx_cols).loc[mask.values].indexdf['Indicator'] = pd.Series(df.set_index(idx_cols).index.values).isin(idx).astype(int)print(df)   Customer  Period Question Score  Indicator0         A       1      foo     2          11         A       1      bar     3          12         A       1      baz   yes          13         A       1      biz     1          14         B       1      bar     2          05         B       1      baz    no          06         B       1      qux     3          07         A       2      foo     5          18         A       2      baz   yes          19         B       2      baz   yes          110        B       2      biz     2          1


You can factorize the tuples of Customer and Period. Then use np.logical_or.at to get group-wise any

i, r = pd.factorize([*zip(df.Customer, df.Period)])a = np.zeros(len(r), dtype=np.bool8)np.logical_or.at(a, i, df.eval('Question == "baz" and Score == "yes"'))df.assign(Indicator=a[i].astype(np.int64))   Customer  Period Question Score  Indicator0         A       1      foo     2          11         A       1      bar     3          12         A       1      baz   yes          13         A       1      biz     1          14         B       1      bar     2          05         B       1      baz    no          06         B       1      qux     3          07         A       2      foo     5          18         A       2      baz   yes          19         B       2      baz   yes          110        B       2      biz     2          1

Explanation

i, r = pd.factorize([*zip(df.Customer, df.Period)])

produces unique (Customer, Period) pairs in r where i is an array keeping track of which element of r went where in order to produce the original list of tuples

  1. Original list of tuples

    [*zip(df.Customer, df.Period)][('A', 1), ('A', 1), ('A', 1), ('A', 1), ('B', 1), ('B', 1), ('B', 1), ('A', 2), ('A', 2), ('B', 2), ('B', 2)]
  2. After factorizing, unique tuples r

    rarray([('A', 1), ('B', 1), ('A', 2), ('B', 2)], dtype=object)
  3. And the positions i

    iarray([0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 3])

I can now use i as indices for evaluating grouped any in Numpy using Numpy's at method on ufuncs. Basically, this allows me to create an array upfront whose values may change based on my at operation. Then specify an array of indices (that's what i will be) and an array matching the size of i that is the second part of my operation at that index.

I end up using as my matching array

df.eval('Question == "baz" and Score == "yes"')0     False1     False2      True3     False4     False5     False6     False7     False8      True9      True10    Falsedtype: bool

Let me show this in painstaking detail

     Flag  GroupIndex   Group    State of a0   False           0  (A, 1)  [0, 0, 0, 0]  # Flag is False, So do Nothing1   False           0  (A, 1)  [0, 0, 0, 0]  # Flag is False, So do Nothing2    True           0  (A, 1)  [1, 0, 0, 0]  # Flag is True, or_eq for Index 03   False           0  (A, 1)  [1, 0, 0, 0]  # Flag is False, So do Nothing4   False           1  (B, 1)  [1, 0, 0, 0]  # Flag is False, So do Nothing5   False           1  (B, 1)  [1, 0, 0, 0]  # Flag is False, So do Nothing6   False           1  (B, 1)  [1, 0, 0, 0]  # Flag is False, So do Nothing7   False           2  (A, 2)  [1, 0, 0, 0]  # Flag is False, So do Nothing8    True           2  (A, 2)  [1, 0, 1, 0]  # Flag is True, or_eq for Index 29    True           3  (B, 2)  [1, 0, 1, 1]  # Flag is True, or_eq for Index 310  False           3  (B, 2)  [1, 0, 1, 1]  # Flag is False, So do Nothing

The final State is [1, 0, 1, 1] or in boolean terms [True, False, True, True]. And that represents the or accumulation within each unique group that is housed in a

aarray([ True, False,  True,  True])

If I slice this with the index positions in i and cast as integers, I get

a[i].astype(np.int64)array([1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1])

Which is precisely what we were looking for.

Finally, I use assign to produce a copy of the dataframe with its new column.

df.assign(Indicator=a[i].astype(np.int64))   Customer  Period Question Score  Indicator0         A       1      foo     2          11         A       1      bar     3          12         A       1      baz   yes          13         A       1      biz     1          14         B       1      bar     2          05         B       1      baz    no          06         B       1      qux     3          07         A       2      foo     5          18         A       2      baz   yes          19         B       2      baz   yes          110        B       2      biz     2          1

Why Do it This Way?!

Numpy is often faster.
Below is a slightly more optimized approach. (basically the same)

i, r = pd.factorize([*zip(df.Customer, df.Period)])a = np.zeros(len(r), dtype=np.bool8)q = df.Question.values == 'baz's = df.Score.values == 'yes'm = q & snp.logical_or.at(a, i, m)df.assign(Indicator=a[i].astype(np.int64))