propagate conditional column value in pandas

python pandas

You can use

In [954]: df['Indicator'] = (df.assign(eq=df.Question.eq('baz') & df.Score.eq('yes'))                               .groupby(['Customer', 'Period'])['eq']                               .transform('any').astype(int))In [955]: dfOut[955]:   Customer  Period Question Score  Indicator0         A       1      foo     2          11         A       1      bar     3          12         A       1      baz   yes          13         A       1      biz     1          14         B       1      bar     2          05         B       1      baz    no          06         B       1      qux     3          07         A       2      foo     5          18         A       2      baz   yes          19         B       2      baz   yes          110        B       2      biz     2          1

Details

In [956]: df.Question.eq('baz') & df.Score.eq('yes')Out[956]:0     False1     False2      True3     False4     False5     False6     False7     False8      True9      True10    Falsedtype: boolIn [957]: df.assign(eq=df.Question.eq('baz') & df.Score.eq('yes'))Out[957]:   Customer  Period Question Score  Indicator     eq0         A       1      foo     2          1  False1         A       1      bar     3          1  False2         A       1      baz   yes          1   True3         A       1      biz     1          1  False4         B       1      bar     2          0  False5         B       1      baz    no          0  False6         B       1      qux     3          0  False7         A       2      foo     5          1  False8         A       2      baz   yes          1   True9         B       2      baz   yes          1   True10        B       2      biz     2          1  False

python pandas

Here's one way. The idea is to use a Boolean mask with MultiIndex. Then use pd.Series.isin to compare against your filtered indices.

mask = (df['Question'] == 'baz') & (df['Score'] == 'yes')idx_cols = ['Customer', 'Period']idx = df.set_index(idx_cols).loc[mask.values].indexdf['Indicator'] = pd.Series(df.set_index(idx_cols).index.values).isin(idx).astype(int)print(df)   Customer  Period Question Score  Indicator0         A       1      foo     2          11         A       1      bar     3          12         A       1      baz   yes          13         A       1      biz     1          14         B       1      bar     2          05         B       1      baz    no          06         B       1      qux     3          07         A       2      foo     5          18         A       2      baz   yes          19         B       2      baz   yes          110        B       2      biz     2          1

python pandas

You can factorize the tuples of Customer and Period. Then use np.logical_or.at to get group-wise any

i, r = pd.factorize([*zip(df.Customer, df.Period)])a = np.zeros(len(r), dtype=np.bool8)np.logical_or.at(a, i, df.eval('Question == "baz" and Score == "yes"'))df.assign(Indicator=a[i].astype(np.int64))   Customer  Period Question Score  Indicator0         A       1      foo     2          11         A       1      bar     3          12         A       1      baz   yes          13         A       1      biz     1          14         B       1      bar     2          05         B       1      baz    no          06         B       1      qux     3          07         A       2      foo     5          18         A       2      baz   yes          19         B       2      baz   yes          110        B       2      biz     2          1

Explanation

i, r = pd.factorize([*zip(df.Customer, df.Period)])

produces unique (Customer, Period) pairs in r where i is an array keeping track of which element of r went where in order to produce the original list of tuples

Original list of tuples

[*zip(df.Customer, df.Period)][('A', 1), ('A', 1), ('A', 1), ('A', 1), ('B', 1), ('B', 1), ('B', 1), ('A', 2), ('A', 2), ('B', 2), ('B', 2)]

After factorizing, unique tuples r

rarray([('A', 1), ('B', 1), ('A', 2), ('B', 2)], dtype=object)

And the positions i

iarray([0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 3])

I can now use i as indices for evaluating grouped any in Numpy using Numpy's at method on ufuncs. Basically, this allows me to create an array upfront whose values may change based on my at operation. Then specify an array of indices (that's what i will be) and an array matching the size of i that is the second part of my operation at that index.

I end up using as my matching array

df.eval('Question == "baz" and Score == "yes"')0     False1     False2      True3     False4     False5     False6     False7     False8      True9      True10    Falsedtype: bool

Let me show this in painstaking detail

     Flag  GroupIndex   Group    State of a0   False           0  (A, 1)  [0, 0, 0, 0]  # Flag is False, So do Nothing1   False           0  (A, 1)  [0, 0, 0, 0]  # Flag is False, So do Nothing2    True           0  (A, 1)  [1, 0, 0, 0]  # Flag is True, or_eq for Index 03   False           0  (A, 1)  [1, 0, 0, 0]  # Flag is False, So do Nothing4   False           1  (B, 1)  [1, 0, 0, 0]  # Flag is False, So do Nothing5   False           1  (B, 1)  [1, 0, 0, 0]  # Flag is False, So do Nothing6   False           1  (B, 1)  [1, 0, 0, 0]  # Flag is False, So do Nothing7   False           2  (A, 2)  [1, 0, 0, 0]  # Flag is False, So do Nothing8    True           2  (A, 2)  [1, 0, 1, 0]  # Flag is True, or_eq for Index 29    True           3  (B, 2)  [1, 0, 1, 1]  # Flag is True, or_eq for Index 310  False           3  (B, 2)  [1, 0, 1, 1]  # Flag is False, So do Nothing

The final State is [1, 0, 1, 1] or in boolean terms [True, False, True, True]. And that represents the or accumulation within each unique group that is housed in a

aarray([ True, False,  True,  True])

If I slice this with the index positions in i and cast as integers, I get

a[i].astype(np.int64)array([1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1])

Which is precisely what we were looking for.

Finally, I use assign to produce a copy of the dataframe with its new column.

df.assign(Indicator=a[i].astype(np.int64))   Customer  Period Question Score  Indicator0         A       1      foo     2          11         A       1      bar     3          12         A       1      baz   yes          13         A       1      biz     1          14         B       1      bar     2          05         B       1      baz    no          06         B       1      qux     3          07         A       2      foo     5          18         A       2      baz   yes          19         B       2      baz   yes          110        B       2      biz     2          1

Why Do it This Way?!

Numpy is often faster.
Below is a slightly more optimized approach. (basically the same)

i, r = pd.factorize([*zip(df.Customer, df.Period)])a = np.zeros(len(r), dtype=np.bool8)q = df.Question.values == 'baz's = df.Score.values == 'yes'm = q & snp.logical_or.at(a, i, m)df.assign(Indicator=a[i].astype(np.int64))

CodeHunter

propagate conditional column value in pandas

Explanation

Why Do it This Way?!

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last