Identify consecutive same values in Pandas Dataframe, with a Groupby

python pandas numpy lambda

You can try this; 1) Create an extra group variable with df.value.diff().ne(0).cumsum() to denote the value changes; 2) use transform('size') to calculate the group size and compare with three, then you get the flag column you need:

df['flag'] = df.value.groupby([df.id, df.value.diff().ne(0).cumsum()]).transform('size').ge(3).astype(int) df

Break downs:

1) diff is not equal to zero (which is literally what df.value.diff().ne(0) means) gives a condition True whenever there is a value change:

df.value.diff().ne(0)#0      True#1     False#2      True#3      True#4     False#5     False#6      True#7     False#8     False#9     False#10     True#11     True#12     True#13    False#14    False#15     True#16    False#17     True#18    False#19    False#20    False#21    False#Name: value, dtype: bool

2) Then cumsum gives a non descending sequence of ids where each id denotes a consecutive chunk with same values, note when summing boolean values, True is considered as one while False is considered as zero:

df.value.diff().ne(0).cumsum()#0     1#1     1#2     2#3     3#4     3#5     3#6     4#7     4#8     4#9     4#10    5#11    6#12    7#13    7#14    7#15    8#16    8#17    9#18    9#19    9#20    9#21    9#Name: value, dtype: int64

3) combined with id column, you can group the data frame, calculate the group size and get the flag column.

python pandas numpy lambda

See EDIT2 for a more robust solution

Same result, but a little bit faster:

labels = (df.value != df.value.shift()).cumsum()df['flag'] = (labels.map(labels.value_counts()) >= 3).astype(int)    id  value  flag0    1      2     01    1      2     02    1      3     03    1      2     14    1      2     15    1      2     16    1      3     17    1      3     18    1      3     19    1      3     110   2      1     011   2      4     012   2      1     113   2      1     114   2      1     115   2      4     016   2      4     017   2      1     118   2      1     119   2      1     120   2      1     121   2      1     1

Where:

df.value != df.value.shift() gives the value change
cumsum() creates "labels" for each group of same value
labels.value_counts() counts the occurrences of each label
labels.map(...) replaces labels by the counts computed above
>= 3 creates a boolean mask on count value
astype(int) casts the booleans to int

In my hands it give 1.03ms on your df, compared to 2.1ms for Psidoms' approach.But mine is not one-liner.

EDIT:

A mix between both approaches is even faster

labels = df.value.diff().ne(0).cumsum()df['flag'] = (labels.map(labels.value_counts()) >= 3).astype(int)

Gives 911µs with your sample df.

EDIT2: correct solution to account for id change, as pointed by @clg4

labels = (df.value.diff().ne(0) | df.id.diff().ne(0)).cumsum()df['flag'] = (labels.map(labels.value_counts()) >= 3).astype(int)

Where ... | df.id.diff().ne(0) increment the label where the id changes

This works even with same value on id change (tested with value 3 on index 10) and takes 1.28ms

EDIT3: Better explanations

Take the case where index 10 has value 3. df.id.diff().ne(0)

data={'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2],      'value':[2,2,3,2,2,2,3,3,3,3,3,4,1,1,1,4,4,1,1,1,1,1]}df=pd.DataFrame.from_dict(data)df['id_diff'] = df.id.diff().ne(0).astype(int)df['val_diff'] = df.value.diff().ne(0).astype(int)df['diff_or'] = (df.id.diff().ne(0) | df.value.diff().ne(0)).astype(int)df['labels'] = df['diff_or'].cumsum()     id  value  id_diff  val_diff  diff_or  labels 0    1      2        1         1        1       1 1    1      2        0         0        0       1 2    1      3        0         1        1       2 3    1      2        0         1        1       3 4    1      2        0         0        0       3 5    1      2        0         0        0       3 6    1      3        0         1        1       4 7    1      3        0         0        0       4 8    1      3        0         0        0       4 9    1      3        0         0        0       4>10   2      3        1    |    0    =   1       5 <== label increment 11   2      4        0         1        1       6 12   2      1        0         1        1       7 13   2      1        0         0        0       7 14   2      1        0         0        0       7 15   2      4        0         1        1       8 16   2      4        0         0        0       8 17   2      1        0         1        1       9 18   2      1        0         0        0       9 19   2      1        0         0        0       9 20   2      1        0         0        0       9 21   2      1        0         0        0       9

The | is operator "bitwise-or", which gives True as long as one of the elements is True. So if there is no diff in value where the id changes, the | reflects the id change. Otherwise it changes nothing.When .cumsum() is performed, the label is incremented where the id changes, so the value 3 at index 10 is not grouped with values 3 from indexes 6-9.

python pandas numpy lambda

#try this simpler versiona= pd.Series([1,1,1,2,3,4,5,5,5,7,8,0,0,0])b= a.groupby([a.ne(0), a]).transform('size').ge(3).astype('int')#ge(x) <- x is the number of consecutive repeated values print b

CodeHunter

Identify consecutive same values in Pandas Dataframe, with a Groupby

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last