Comparing previous row values in Pandas DataFrame Comparing previous row values in Pandas DataFrame numpy numpy

Comparing previous row values in Pandas DataFrame


You need eq with shift:

df['match'] = df.col1.eq(df.col1.shift())print (df)   col1  match0     1  False1     3  False2     3   True3     1  False4     2  False5     3  False6     2  False7     2   True

Or instead eq use ==, but it is a bit slowier in large DataFrame:

df['match'] = df.col1 == df.col1.shift()print (df)   col1  match0     1  False1     3  False2     3   True3     1  False4     2  False5     3  False6     2  False7     2   True

Timings:

import pandas as pddata={'col1':[1,3,3,1,2,3,2,2]}df=pd.DataFrame(data,columns=['col1'])print (df)#[80000 rows x 1 columns]df = pd.concat([df]*10000).reset_index(drop=True)df['match'] = df.col1 == df.col1.shift()df['match1'] = df.col1.eq(df.col1.shift())print (df)In [208]: %timeit df.col1.eq(df.col1.shift())The slowest run took 4.83 times longer than the fastest. This could mean that an intermediate result is being cached.1000 loops, best of 3: 933 µs per loopIn [209]: %timeit df.col1 == df.col1.shift()1000 loops, best of 3: 1 ms per loop


1) pandas approach: Use diff:

df['match'] = df['col1'].diff().eq(0)

2) numpy approach: Use np.ediff1d.

df['match'] = np.ediff1d(df['col1'].values, to_begin=np.NaN) == 0

Both produce:

enter image description here

Timings: (for the same DF used by @jezrael)

%timeit df.col1.eq(df.col1.shift())1000 loops, best of 3: 731 µs per loop%timeit df['col1'].diff().eq(0)1000 loops, best of 3: 405 µs per loop


Here's a NumPy arrays based approach using slicing that lets us use the views into the input array for efficiency purposes -

def comp_prev(a):    return np.concatenate(([False],a[1:] == a[:-1]))df['match'] = comp_prev(df.col1.values)

Sample run -

In [48]: df['match'] = comp_prev(df.col1.values)In [49]: dfOut[49]:    col1  match0     1  False1     3  False2     3   True3     1  False4     2  False5     3  False6     2  False7     2   True

Runtime test -

In [56]: data={'col1':[1,3,3,1,2,3,2,2]}    ...: df0=pd.DataFrame(data,columns=['col1'])    ...: #@jezrael's soln1In [57]: df = pd.concat([df0]*10000).reset_index(drop=True)In [58]: %timeit df['match'] = df.col1 == df.col1.shift() 1000 loops, best of 3: 1.53 ms per loop#@jezrael's soln2In [59]: df = pd.concat([df0]*10000).reset_index(drop=True)In [60]: %timeit df['match'] = df.col1.eq(df.col1.shift())1000 loops, best of 3: 1.49 ms per loop#@Nickil Maveli's soln1   In [61]: df = pd.concat([df0]*10000).reset_index(drop=True)In [64]: %timeit df['match'] = df['col1'].diff().eq(0) 1000 loops, best of 3: 1.02 ms per loop#@Nickil Maveli's soln2In [65]: df = pd.concat([df0]*10000).reset_index(drop=True)In [66]: %timeit df['match'] = np.ediff1d(df['col1'].values, to_begin=np.NaN) == 01000 loops, best of 3: 1.52 ms per loop# Posted approach in this postIn [67]: df = pd.concat([df0]*10000).reset_index(drop=True)In [68]: %timeit df['match'] = comp_prev(df.col1.values)1000 loops, best of 3: 376 µs per loop