Comparing previous row values in Pandas DataFrame
df['match'] = df.col1.eq(df.col1.shift())print (df) col1 match0 1 False1 3 False2 3 True3 1 False4 2 False5 3 False6 2 False7 2 True
Or instead eq
use ==
, but it is a bit slowier in large DataFrame:
df['match'] = df.col1 == df.col1.shift()print (df) col1 match0 1 False1 3 False2 3 True3 1 False4 2 False5 3 False6 2 False7 2 True
Timings:
import pandas as pddata={'col1':[1,3,3,1,2,3,2,2]}df=pd.DataFrame(data,columns=['col1'])print (df)#[80000 rows x 1 columns]df = pd.concat([df]*10000).reset_index(drop=True)df['match'] = df.col1 == df.col1.shift()df['match1'] = df.col1.eq(df.col1.shift())print (df)In [208]: %timeit df.col1.eq(df.col1.shift())The slowest run took 4.83 times longer than the fastest. This could mean that an intermediate result is being cached.1000 loops, best of 3: 933 µs per loopIn [209]: %timeit df.col1 == df.col1.shift()1000 loops, best of 3: 1 ms per loop
1) pandas approach: Use diff
:
df['match'] = df['col1'].diff().eq(0)
2) numpy approach: Use np.ediff1d
.
df['match'] = np.ediff1d(df['col1'].values, to_begin=np.NaN) == 0
Both produce:
Timings: (for the same DF
used by @jezrael)
%timeit df.col1.eq(df.col1.shift())1000 loops, best of 3: 731 µs per loop%timeit df['col1'].diff().eq(0)1000 loops, best of 3: 405 µs per loop
Here's a NumPy arrays based approach using slicing
that lets us use the views into the input array for efficiency purposes -
def comp_prev(a): return np.concatenate(([False],a[1:] == a[:-1]))df['match'] = comp_prev(df.col1.values)
Sample run -
In [48]: df['match'] = comp_prev(df.col1.values)In [49]: dfOut[49]: col1 match0 1 False1 3 False2 3 True3 1 False4 2 False5 3 False6 2 False7 2 True
Runtime test -
In [56]: data={'col1':[1,3,3,1,2,3,2,2]} ...: df0=pd.DataFrame(data,columns=['col1']) ...: #@jezrael's soln1In [57]: df = pd.concat([df0]*10000).reset_index(drop=True)In [58]: %timeit df['match'] = df.col1 == df.col1.shift() 1000 loops, best of 3: 1.53 ms per loop#@jezrael's soln2In [59]: df = pd.concat([df0]*10000).reset_index(drop=True)In [60]: %timeit df['match'] = df.col1.eq(df.col1.shift())1000 loops, best of 3: 1.49 ms per loop#@Nickil Maveli's soln1 In [61]: df = pd.concat([df0]*10000).reset_index(drop=True)In [64]: %timeit df['match'] = df['col1'].diff().eq(0) 1000 loops, best of 3: 1.02 ms per loop#@Nickil Maveli's soln2In [65]: df = pd.concat([df0]*10000).reset_index(drop=True)In [66]: %timeit df['match'] = np.ediff1d(df['col1'].values, to_begin=np.NaN) == 01000 loops, best of 3: 1.52 ms per loop# Posted approach in this postIn [67]: df = pd.concat([df0]*10000).reset_index(drop=True)In [68]: %timeit df['match'] = comp_prev(df.col1.values)1000 loops, best of 3: 376 µs per loop