How to remove a pandas dataframe from another dataframe How to remove a pandas dataframe from another dataframe python python

How to remove a pandas dataframe from another dataframe


Solution

Use pd.concat followed by drop_duplicates(keep=False)

pd.concat([df1, df2, df2]).drop_duplicates(keep=False)

It looks like

   a  b1  3  4

Explanation

pd.concat adds the two DataFrames together by appending one right after the other. if there is any overlap, it will be captured by the drop_duplicates method. However, drop_duplicates by default leaves the first observation and removes every other observation. In this case, we want every duplicate removed. Hence, the keep=False parameter which does exactly that.

A special note to the repeated df2. With only one df2 any row in df2 not in df1 won't be considered a duplicate and will remain. This solution with only one df2 only works when df2 is a subset of df1. However, if we concat df2 twice, it is guaranteed to be a duplicate and will subsequently be removed.


You can use .duplicated, which has the benefit of being fairly expressive:

%%timeitcombined = df1.append(df2)combined[~combined.index.duplicated(keep=False)]1000 loops, best of 3: 875 µs per loop

For comparison:

%timeit df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']100 loops, best of 3: 4.57 ms per loop%timeit pd.concat([df1, df2, df2]).drop_duplicates(keep=False)1000 loops, best of 3: 987 µs per loop%timeit df2[df2.apply(lambda x: x.value not in df2.values, axis=1)]1000 loops, best of 3: 546 µs per loop

In sum, using the np.array comparison is fastest. Don't need the .tolist() there.


A set logic approach. Turn the rows of df1 and df2 into sets. Then use set subtraction to define new DataFrame

idx1 = set(df1.set_index(['a', 'b']).index)idx2 = set(df2.set_index(['a', 'b']).index)pd.DataFrame(list(idx1 - idx2), columns=df1.columns)   a  b0  3  4