Anti-Join Pandas Anti-Join Pandas pandas pandas

Anti-Join Pandas


indicator = True in merge command will tell you which join was applied by creating new column _merge with three possible values:

  • left_only
  • right_only
  • both

Keep right_only and left_only. That is it.

outer_join = TableA.merge(TableB, how = 'outer', indicator = True)anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1)

easy!

Here is a comparison with a solution from piRSquared:

1) When run on this example matching based on one column, piRSquared's solution is faster.

2) But it only works for matching on one column. If you want to match on several columns - my solution works just as fine as with one column.

So it's up for you to decide.

enter image description here


Consider the following dataframes

TableA = pd.DataFrame(np.random.rand(4, 3),                      pd.Index(list('abcd'), name='Key'),                      ['A', 'B', 'C']).reset_index()TableB = pd.DataFrame(np.random.rand(4, 3),                      pd.Index(list('aecf'), name='Key'),                      ['A', 'B', 'C']).reset_index()

TableA

enter image description here


TableB

enter image description here

This is one way to do what you want

Method 1

# Identify what values are in TableB and not in TableAkey_diff = set(TableB.Key).difference(TableA.Key)where_diff = TableB.Key.isin(key_diff)# Slice TableB accordingly and append to TableATableA.append(TableB[where_diff], ignore_index=True)

enter image description here

Method 2

rows = []for i, row in TableB.iterrows():    if row.Key not in TableA.Key.values:        rows.append(row)pd.concat([TableA.T] + rows, axis=1).T

Timing

4 rows with 2 overlap

Method 1 is much quicker

enter image description here

10,000 rows 5,000 overlap

loops are bad

enter image description here


I had the same problem. This answer using how='outer' and indicator=True of merge inspired me to come up with this solution:

import pandas as pdimport numpy as npTableA = pd.DataFrame(np.random.rand(4, 3),                      pd.Index(list('abcd'), name='Key'),                      ['A', 'B', 'C']).reset_index()TableB = pd.DataFrame(np.random.rand(4, 3),                      pd.Index(list('aecf'), name='Key'),                      ['A', 'B', 'C']).reset_index()print('TableA', TableA, sep='\n')print('TableB', TableB, sep='\n')TableB_only = pd.merge(    TableA, TableB,    how='outer', on='Key', indicator=True, suffixes=('_foo','')).query(        '_merge == "right_only"')print('TableB_only', TableB_only, sep='\n')Table_concatenated = pd.concat((TableA, TableB_only), join='inner')print('Table_concatenated', Table_concatenated, sep='\n')

Which prints this output:

TableA  Key         A         B         C0   a  0.035548  0.344711  0.8609181   b  0.640194  0.212250  0.2773592   c  0.592234  0.113492  0.0374443   d  0.112271  0.205245  0.227157TableB  Key         A         B         C0   a  0.754538  0.692902  0.5377041   e  0.499092  0.864145  0.0045592   c  0.082087  0.682573  0.4216543   f  0.768914  0.281617  0.924693TableB_only  Key  A_foo  B_foo  C_foo         A         B         C      _merge4   e    NaN    NaN    NaN  0.499092  0.864145  0.004559  right_only5   f    NaN    NaN    NaN  0.768914  0.281617  0.924693  right_onlyTable_concatenated  Key         A         B         C0   a  0.035548  0.344711  0.8609181   b  0.640194  0.212250  0.2773592   c  0.592234  0.113492  0.0374443   d  0.112271  0.205245  0.2271574   e  0.499092  0.864145  0.0045595   f  0.768914  0.281617  0.924693